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Preface 


This book has been prepared for use in the second-year course in Mathematical Stat- 
istics at the University of Copenhagen, as a part of the curriculum for a B.Sc. in 
Mathematics. It gives an exposition of fundamental elements of Mathematical Stat- 
istics and parametric inference, chiefly based on the method of maximum likelihood 
and focusing on exponential families, which in particular enables establishing the 
asymptotic properties of the method of maximum likelihood with some rigour. 

The general structure and contents have been inspired by Hansen (2012) who fol- 
lows on from more than 50 years of teaching tradition in Copenhagen, notably based 
on lecture notes by Steen Andersson and S¢gren Tolver Jensen. These again build on 
the first comprehensive course in Mathematical Statistics created by Hans Brgns in 
the 1960s, inspired by Hald (1952) and refined over the years by many colleagues. 

A key concept in this tradition is the focus on the statistical model and its proper- 
ties, rather than the design of various ad hoc methods; the model identifies a frame- 
work used to reason about the problems considered, and the methods are derived 
from general principles, here mostly the method of maximum likelihood. An import- 
ant feature has been to give a geometric treatment of the multivariate normal distri- 
bution and associated linear normal models on a finite-dimensional vector space. 

I have also been inspired by material prepared by Helle Sgrensen who has been 
teaching the course before me. Draft editions of the book have been used for teaching 
this course by Niels Richard Hansen, Anders Tolver, and myself, and I am grateful for 
detailed and constructive comments received from Anders Tolver and Niels Richard 
Hansen at various stages in the process of writing this text. 

The book assumes that a course in Probability based on Measure Theory is taught 
in parallel or prior to this course and that the students have a good background in 
Mathematical Analysis and Linear Algebra. Some important results from these areas 
are briefly described in appendices. 

It is useful and probably necessary that the students previously have been ac- 
quantied with general statistical thinking and simple statistical methods at least up to 
the level of simple linear regression in the normal distribution. 


Copenhagen, July 2022. Steffen Lauritzen 


Taylor & Francis 
Taylor & Francis Group 


http://taylorandfrancis.com 


List of Figures 


Poisson likelihood 
Uniform likelihood 
Simulated Poisson likelihood 


Contour ellipses 
Visualization of the decomposition theorem 


Curved exponential family 
Mean on semi-circle 
Fixed coefficient of variation 


Estimation of range of uniform distribution 
Estimating mean of exponential distribution 

Least absolute deviations for exponential distribution 
Moment estimators in the normal distribution 


Visualization of maximum-likelihood estimation 

Estimation of mean on semi-circle 

Estimation for constant coefficient of variation 

Visualization of maximum-likelihood estimation in nested models 


Simulation for confidence intervals 

Numerical determination of confidence intervals 
Confidence ellipse for parameters of gamma distribution 
Confidence region for parameters of gamma distribution 


Power of multiple choice examination 


Exact test 


xi 


114 


123 
125 


140 
141 
146 
146 


163 


elt 


Taylor & Francis 
Taylor & Francis Group 


http://taylorandfrancis.com 


List of Tables 


Empirical coverage of confidence intervals 
Starch content in potatoes 


Weldon’s dice 

Weight of 300 trout 

Aspirin and myocardial infarction 

Contingency table for comparing multinomial distributions 

Hair and eye colour 

Standardized residuals for hair and eye colour 

Covid 19 infections 

Residuals for multiplicative Poisson model and Covid 19 infections 
Population size in local areas 

Residuals for shifted multiplicative Poisson model and Covid 19 
Expected Covid 19 infections 

Confidence intervals for Covid 19 infection parameters 
Smoking and myocardial infarction 


xiii 


150 


176 


185 
191 
194 
196 
198 
198 
202 
202 
202 
205 
206 
206 
209 


Taylor & Francis 
Taylor & Francis Group 


http://taylorandfrancis.com 


Chapter 1 


Statistical Models 


1.1 Models and parametrizations 


The basic object in mathematical statistics is a statistical model defined formally as 
follows: 


Definition 1.1. A statistical model consists of a measure space (4’, E) and a family 
P of probability measures on (1, E). The space (4, E) is the representation space 
of the model. 


The interpretation of a statistical model is that data x is an observed outcome 
X = « of arandom variable X with values in ¥ and distribution given by an un- 
known probability measure P € P. The task of the statistician is to say something 
sensible about P, based on the observation x. This process is known as statistical 
inference. In the machine learning literature, it is more common to refer to this task 
as that of learning P from data x. 

This formulation of the basic statistical problem and associated inference task 
was introduced by Sir Ronald Aylmer Fisher (1890-1962) in Fisher (1922), and we 
shall sometimes refer to such a statistical model as a Fisherian statistical model. An 
alternative approach to statistical inference is based on a Bayesian statistical model, 
but we shall only briefly touch on this approach in Section 6.6. 

The reader should be aware that the notion of a statistical model as used here 
is quite different from the notion of a model as used by most scientists. A scientist 
would rather think of the specific P as the model, rather than the family P. This 
is a source of considerable confusion, also among statisticians who sometimes use 
the term ‘true model’ for the specific P. We shall avoid this term in the subsequent 
developments. 

The role of a statistical model is primarily to specify the framework within which 
our reasoning about the problem takes place. Sometimes the model has the purpose 
of giving a good description of reality, but other times—and quite often—it is rather 
used as a basis for saying something interesting about reality, for example that certain 
aspects of the observations do not conform with the model used, thereby providing a 
basis for further understanding of the reality behind the observations. 

It is convenient that the family P is parametrized, i.e. that we have a surjective 
map vy : O++ P such that 


P = {v(0)|0€ O} = {P9|0<¢ 0} 
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where we traditionally write Ps instead of v(@). Such a map is a parametrization 
of the statistical model, the space © is the parameter space, and @ is a parameter. 
Sometimes the parametrization is just a technical device to label the measures in P, 
but the parameter 0 might be a quantity of its own interest in the scientific domain 
where data x have been collected to determine 6. 

As we shall discuss in detail later, statistical inference can take many shapes. 
Methods of estimation are concerned with systematic attempts of guessing the value 
of @ that is behind the observation x (see Chapter 4); set estimation generates a subset 
C(a) C © as a function of x within which the unknown parameter @ may be; this is 
discussed in Chapter 6. Methods for hypothesis testing, described in Chapter 7, are 
concerned with assessing whether the observation x conforms with a value 6 € Oo, 
where Og is a pre-specified subset of parameter values in O. Clearly, these activities 
are intimately related. 

Generally the parameter space © can be any set, but in many cases O is a well- 
behaved subset of R” with non-empty interior, in which case it is common to say 
that the family or model is parametric and has dimension k. Also, we would mostly 
assume that the map @ ++ Po is injective, so that 6; 4 02 => Po», # Po, and often 
we shall assume or exploit that the parametrization is a ‘nice’ function of # in some 
specified way. 

If there is no simple parametrization with © C R*, we say the model is non- 
parametric. An example of a non-parametric model could be the family P2 of prob- 
ability measures on (V, E) = (R, B) with finite second moments, i.e. 


PEP, = / ax” P(dx) < o. 


This book has its focus on parametric statistical models although we from time to 
time give examples of non-parametric ones. 


1.1.1 Examples of statistical models 


Here we shall briefly mention a number of common and useful statistical models. 
We shall focus on their mathematical specification rather than their application to 
specific problems, and many of them will reappear from time to time in later sections 
and chapters. 


Example 1.2. [The simple Poisson model] In this model we consider a random vari- 
able X taking values in Y = No = {0,1,2,...}, and the family of probability 
measures is P = {P| € A}, where P) is the Poisson distribution 


P\(X =2) =—e, 1 £ENp. (1.1) 


We must also specify the possible values of \ and use A = R, = (0,00) as para- 
meter space. This model has representation space (No, P(No)), where P(A) denotes 
the power set of A, i.e. the set of all subsets of A. 

A variant of this model appears when we have independent and identically dis- 
tributed observations X1,...,X, from such a model. This would lead to a model 
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with representation space (Nj, P(NG)), parameter space A = R, and the family of 
probability measures P = {P2" | € A}. 

Note that the parameter space A is the same for the two variants of the simple 
Poisson model, whereas the representation space and family of probability measures 
is obtained by an n-fold product of the entities for the models corresponding to a 
single observation. 

The simple Poisson model is often used to describe the number of events hap- 
pening in a time period or in a specific region; a notable example is the number of 
a-particles emitted in a time interval from radioactive materials where this model is 
known to give a very accurate description of the phenomenon. 

In principle we could modify the Poisson models above by adding X = 0 to 
the parameter space, allowing for the possibility that X = 0 with probability one. 
However, this will typically add complications to other aspects of the analysis of the 
model so this is not normally done. 


We shall often consider the situation where we have independent and identically 
distributed observations X; = %1,..., Xn = ®p from a given distribution P. In such 
cases we also say that we have a sample x = (11,...,%n) or X = (Xq,..., Xp) of 
size n from P, depending on whether or not we are emphasizing the randomness of 
the values. Our next example is related to the Poisson model: 


Example 1.3. [The exponential distribution] Consider a sample (X1,..., X,,) from 
an exponential distribution where X; have density 


e7t/b 
fu(z) = , «&>0 
[Lb 

with respect to Lebesgue measure on R;. The parameter space is M = R4, the 

representation space is (IR", B(R".)), where B(R‘:) are the Borel sets of R'}, and 

the family of probability distributions is P = APe € M}, where P,, is the 
distribution with density f,, as above. 

The exponential distribution is closely associated with the Poisson distribution as 

P,, is the distribution of the waiting time to an event, where the number of events in a 

time interval of length ¢ is Poisson distributed with parameter \ = t/ju. The quantity 

1/ is also known as the rate of the exponential distribution. 


The next model is the basic model for tossing a coin repeatedly. 


Example 1.4. [The simple Bernoulli model] Here we consider n independent and 
identically distributed random variables with 


P(X; = 1) =1— P,(X; =0) =n, 


where 0 < ys < 1. The representation space is ({0, 1}”, P({0, 1}”), the parameter 
space is the open interval Mf = (0,1), and the family of probability measures is 
P= cP | yu € M}. As in the Poisson model, we have in principle the possibility 
of adding the endpoints of the interval (0, 1) to the parameter space, but this is usually 
not done. 


A classic is the simple normal model, corresponding to measuring a quantity 
€ € R with variance 0? € R,. 
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Example 1.5. [Simple normal model] We consider independent and identically dis- 
tributed random variables X1,..., Xn with X; ~ N(E, a); i.e. the distribution of 
X; has density 


with respect to standard Lebesgue measure on R, and w = (€,07) € Rx Ry = 
Q. Here the representation space is (IR",B(R)"), and the parameter space is {. 
Submodels of this model appear, for example, when either € or o? is assumed to be 
fixed and known. 


The next class of models is useful for describing positive random quantities. 


Example 1.6. [Gamma models] We consider Xj,..., X,, independent and identic- 
ally distributed with X; having density 


gat 


~ BeL(a) 


with respect to standard Lebesgue measure on R;. We let 0 = (a, 3) and the para- 
meter space be © = R34; the representation space is (IR"), B(R'})); and the family is 
P = {P?"|0 € O}, where Po = Pca.) has density as (1.2). 

We note that this gamma model has a number of interesting submodels, obtained 
by restricting the parameter space appropriately: 


fp (2) e 7/8 (1.2) 


a) The exponential distribution model discussed in Example 1.3, obtained by modi- 
fying the parameter space to 0, = {(a, 8) € O|a= 1}. 

b) The gamma model with fixed shape parameter obtained by modifying the para- 
meter space to O2 = {(a, 8) € O|a = ao} for fixed ap € Ry. 


c) The gamma model with fixed scale parameter obtained by modifying the para- 
meter space to 03 = {(a, 8) € O| 8 = Bo} for fixed Go € Ry. 


d) The gamma model with fixed mean obtained by modifying the parameter space 
to 04 = {(a, 8) € O| a6 = po} for some fixed fo € Ry. 


Clearly this list is far from exhaustive. 


Although this book focuses on parametric models, we shall give an example of a 
non-parametric model, here distributions with decreasing densities. 


Example 1.7. [Decreasing density] This model is often used for modelling waiting 
times until the next of a series of recurrent events occurs. We assume that X1,..., Xn 
are independent and identically distributed positive random variable with X; having 
a non-increasing density f with respect to Lebesgue measure on R41, i.e. the density 
satisfies 


[ t Ba, ay = Fey SFG). (1.3) 


The representation space is again (R", B(R”)), the family of probability measures 
consists of those with density in the parameter space F, where 


F={f :Ri (0,00) | f satisfies (1.3)}. 
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We note again the slightly weird terminology that this is a non-parametric model 
even though the parameter space is huge. It simply indicates that there is no simple 
parametrization of the model with a finite-dimensional parameter space. 


The next two models are useful for illustrating various results that shall appear 
later. 


Example 1.8. [The simple uniform model] Here we consider a random variable X 
which is uniformly distributed on the interval (0,0) where 0 € © = R,. That is, the 
density of X is 


fol) = Flom 


with respect to Lebesgue measure on R. The representation space is (R+,B(R+)), 
the parameter space 9 = R,, and the family is P = {P)|@ € O}, where Py has 
density fg as above. We refrain from an explicit description of the variant with n 
repeated observations X1,..., Xn. 


And finally, we conclude with yet another model that mostly appears for illustra- 
tion of some of the theoretical concepts we shall consider later. 


Example 1.9. [The Cauchy model] This model is, for example, used to describe the 
distribution of hitting points along the z-axis from a source placed at 0 = (a, 3) that 
emits particles uniformly in all directions. Here X has density 


bye : 
fl) = Grae my TSR 


with respect to standard Lebesgue measure on R. The representation space is 
(R, B(R)) and the parameter space is O = R x Ry. 


In the following we shall see other examples of statistical models, and other vari- 
ants of the models listed above. We shall then not always give all the formal details 
in terms of identifying representation space, family, and parameter space, as these 
mostly will be obvious in the given context. 


1.1.2. Reparametrization 


In many cases it is convenient to work with several different parametrizations of 
a Statistical model when the associated family of probability can be represented in 
more than one way, i.e. 


P = {vo|0 € O} = {E\ |r € A}. 


This can be for a number of different reasons; maybe one of the parametrizations is 
particularly natural for interpretation in the scientific context where the model is to be 
used, maybe the parameter has a simple description in terms of the properties of the 
model, or a particular parametrization may be more convenient for the mathematical 
manipulations used when analysing the model. 

We would normally assume that both parametrizations are bijective implying that 
there must be a bijective function ¢ such that \ = ¢(@), and we shall then say that 
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@ is a reparametrization of the model. In this case, we have that 


Ex=Ve-1(r)> Yo = €4(0): 


It is important to know whether methods used for statistical inference are equivariant 
in the sense that, based on data x, methods yield the same inference for vg and €) 
if \ and 6 are related as above. We shall comment on that from time to time. Below 
we shall give a list of examples of relevant reparametrizations of some of the models 
described in Section 1.1.1. 


Example 1.10. [Poisson model] In Example 1.2 we parametrized the Poisson family 
through its mean 


but an alternative parameter could be the null fraction n = P,(X = 0) = e~* with 
the family of distributions represented as 


(= log)” 


P(X =2)=7 a 


Here the function ¢ : Ry +4 (0,1) given as ¢(\) = e~> is bijective, so this 
constitutes a valid reparametrization of the model with a new parameter space 
Hf = (0,1). 
Example 1.11. [Exponential distribution] In Example 1.3 we parametrized the fam- 
ily through its mean yu: 


© ge t/b 
E,.(X) = | = dz = 
0 HM 


but it is sometimes more practical to parametrize the family using the rate \ = 1/ 
and then write the density as 


gle)=re-™, 2S 0. 


Since the map ¢ : Ry + Ry, given as ¢(y) = 1/, is bijective, this constitutes a 
valid reparametrization of the model with a new parameter space A = R4. 


Example 1.12. [Gamma models] The gamma models have a large variety of inter- 
esting parametrizations, and we shall see several of those in later chapters. In Ex- 
ample 1.6 we have used the shape a and scale /, but as for the exponential model 
we would sometimes replace the scale with the rate X = 1/(; as earlier, the map 
o: Ri + R4 defined as (a, 8) = (a, 1/{) is bijective, so this yields an alternat- 
ive parametrization. 

Yet another possibility is to use the mean 4p = a and variance 0? = af? 
as parameters, and the reparametrization function is 7 : R4 ++ R4 defined as 
w(a, B) = (a6, a8"), which again is bijective. 
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Example 1.13. [Decreasing density] For a more subtle example, we consider the 
non-parametric model in Example 1.7 with decreasing densities. Instead of using 
the density as ‘parameter’, we could, for example, use the cumulative distribution 
function 
F(a)=P(X <2), «>0. 
The parameter space would then be 
F ={F:: [0,00) > [0,1]| F is concave and right continuous } 


as there would be a bijection ¢ from the set of (equivalence classes of) decreasing 
densities and F, since a density is equivalent to a decreasing function if and only if 
the corresponding distribution function is concave. 

A yet more subtle parametrization uses a classical result of A. Ya. Khinchin, say- 
ing that a random variable X with values in R4 has a concave distribution function 
if and only if X has the same distribution as X = YU, where U and Y are inde- 
pendent, U is uniformly distributed on (0,1), and Y has an arbitrary distribution 
on |0, 00). This relation defines a natural bijection between ¥ and the set G of all 
distribution functions on R, via the representation for G' € G: 


Fa(e)= [ 7 met G(dy), 


where we have let 0/0 = 1; hence also G may be used as a parameter space for this 
model. 


1.1.3 Parameter functions 


We consider a statistical model with representation space (4’, E) and associated fam- 
ily P. A parameter function is formally a mapping of the form ¢ : P ++ A where A 
is some set, often a subset of R*. 

If v : O +> P is a parametrization of P, we may instead think of the parameter 
function as a map ¢’ : 0 + A by composing it with vy, i.e. d’ = do v. Similarly, if 
vy is an injective parametrization of P, any function ¢ : 0 + A can be considered 
a parameter function by composing it with v~! as ¢ 0 v~!. In the following, we 
shall—in the case of injective parametrizations—interchangeably regard a parameter 
function as defined on © or P without further ado. 

We might not necessarily be interested in the full parameter 6, but sometimes 
only specific aspects of it. This is in particular the case when O is high-dimensional, 
where we for example could be interested in parameter functions as 


a) A specific coordinate of the parameter: ¢(0) = 0; if © C R*. 
b) The mean of the distribution: 6(0) = Eg(X). 

c) The variance of the distribution: ¢(@) = Vo(X). 

d) The median of the distribution: Po{ X < $(0)} = 1/2. 


e) The coefficient of variation of the distribution: 


(9) = Co{X} = V Vot{X}/Eo{X} 


but there are many other possibilities. 
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1.1.4 Nuisance parameters and parameters of interest 


If the parameter space has the form © = A x © we may wish to focus on the com- 
ponent ¢(0) = A, whereas the component € € © just is something we must take 
into account. We say that \ is a parameter of interest and € a nuisance parameter 
reflecting that we consider € as a disturbing factor. 


1.2 Likelihood, score, and information 


In this section we shall exclusively consider a subclass of statistical models where 
the associated family of probability measures is dominated. Formally we define 


Definition 1.14. A family P of probability measures on (¥,E) is said to be p- 
dominated if 1 is a o-finite measure on (1, E) so that every P € P has a density 
with respect to pw. 


In other words, a family is dominated if for all P € P, there exists a non-negative 
real-valued function f such that it holds for all A € E that 


P(A)= [ sle) nla) 


All the numbered examples in Section 1.1.1 are dominated, and this will be the case 
for almost all models considered in this book. An example of a non-dominated fam- 
ily is the family P2 of probability measures on (4, E) = (R, B) with finite second 
moments. These include measures with both discrete and continuous support and 
therefore does not admit a dominating measure. If a dominated family P is paramet- 
rized, we may write 


Po(A) = ff fol) (dz) 


The important fact is that yz is the same for all values of 6 in this expression. Although 
each fg is only defined up to a jz null-set, we shall in the following assume that we 
once and for all have chosen a specific family F = {fo |9 € ©} of densities so 
we without ambiguity may identify the parameter space O, the family of probability 
measures P, and the family of densities F. 


1.2.1 The likelihood function 


The notion of likelihood function was introduced by R. A. Fisher, and despite its 
simplicity, it is arguably the most fundamental concept in statistical theory. 


1.2.1.1 Formal definition 


We consider an injectively parametrized statistical model on (4’,E) with a p- 
dominated family and associated family of densities F = {fo |0 € O}. 

Definition 1.15. For every x € 4, the likelihood function L; : © +> [0, 00) is the 
function given as 


L(9) = fo(x) 
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and the log-likelihood function ¢,, : © + [—oo, co) is the logarithm of the likelihood 
function: 
,(0@) = €(x, 0) = log L,(6). 


In other words, if we consider f,,(@) as a function of two variables, a density 
appears if we consider 6 fixed and x varying, whereas the likelihood appears by 
considering x fixed and @ varying. Given how fundamental the idea of likelihood is, 
its rationale is surprisingly simple; if 


L,(01) > Lz (82) 


or, equivalently, 
fo; (x) > fo. (x), 


then it is more /ikely that the data x have been generated by Pg, than by Po,. In other 
words, high values of L,.(0) or ¢,.(@) means that data x support Py as the generating 
distribution. 

It is important to realize that the likelihood function is only well-defined up to a 
multiplicative constant. When we defined the likelihood function above, we chose a 
base measure yp arbitrarily, and the absolute value of the likelihood function depends 
highly on that choice. 

Suppose that we choose another dominating measure /4 = g - yu. Then the two 
likelihood functions L and L relative to ju and ji will satisfy the relation 


Le(9) = g(@)L2(8), 


because the density of P) with respect to ji will be fo/g, where fg is the density 
of Ps with respect to yu. Since the data zx are fixed, the two likelihood functions are 
proportional but not identical. Indeed, one may show that any two versions of the 
likelihood function are always proportional in the following sense: 


Theorem 1.16. Let P = {P9|0 € ©} be a parametrized statistical model with 
representation space (XE) and let y and ji be dominating measures for P. Then 
the densities and associated likelihood functions L and L may be chosen such that 
there is a measurable function h: X ++ Ry satisfying 


L,(0) = h(x)L,(0), forall@ € @andallx € X. 


Proof. The proof is somewhat technical and omitted. The interested reader may find 
further details in Hansen (2012, p. 77 ff). 


In other words, the likelihood function is only well-defined up to a positive mul- 
tiplicative constant (depending on x only) and only ratios of likelihood functions 
makes solid sense since it then holds that 


L,(61) _ h(a) L(91) _ Lx() 
Lz(2)  h(x)Le(02) Lx (82) 


Similarly, only differences of log-likelihood functions are meaningful. Therefore, it 
is common to ignore such constant terms when making likelihood calculations. 
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Likelihood Log-likelihood 


¢ — £(3) 


Figure 1.1 — The Poisson likelihood and log-likelihood functions for five observations 
(3, 3,6, 1,6), normalized with the value of the function at A = 3. 


Example 1.17. [Poisson likelihood] Consider a simple Poisson model and an obser- 
vation X = z. The likelihood and log-likelihood functions become 


Ey(Aja Ae, (A) = 2 log A— A. 


We have ignored the factor 1/z! which enters as a multiplicative constant in L,, and 
as an additive constant — log z! in @,,. If we are considering the variant of the Poisson 
model with X;,...,X,, being independent and identically distributed, we get from 
observations %1,..., 2, that 


n 


(A) =logd S > aj — nA 


i=1 


where we again have ignored terms that do not depend on X. Figure 1.1 dis- 
plays the likelihood and log-likelihood functions for a sample of five observations 
(21,..-,%5) = (3,3,6,1,6), ie. with n = 5, normalized with the value of the like- 
lihood at A = 3. 

The likelihood and log-likelihood in the Poisson example are well behaved and 
this is mostly the case in situations within the scope of this book. However, for com- 
pleteness, we shall also consider an example where the likelihood is not so well 
behaved. 


Example 1.18. [Uniform likelihood] Consider a sample X = (Xj,...,Xn) from 
the family of uniform distributions on the interval (0,0) for @ € O = Ry, as in 
Example 1.8. We get for the likelihood function 


n 1 _ 
Lo(9) = fo(x) = [] 52(0.0)(ai) = 9 "1, ,00)(8) 
i=1 


where y,, = max(21,..., 2) is the largest observation. The likelihood function is 
plotted in Figure 1.2 for 10 simulated observations from a uniform distribution on the 
interval (0, 2.5). The likelihood function is sharply peaked at the largest observation, 
which necessarily must be smaller than the true value. 
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Likelihood 
3e-04 4 
BS 2-044 
i) 
le-04 4 
0e+00 4 
Or 
0 1 2 3 4 
6 


Figure 1.2 — The likelihood function in the uniform model for 10 simulated observations. 
The true value 9 = 2.5 of the parameter is indicated by a vertical line. 


1.2.1.2. Equivariance of the likelihood function 


Consider a bijective reparametrization ¢ : O +> A with A = (0). The family of 
densities is then represented in the new parametrization as 


{gr.|A € A} = {fo | € OF} 


so the log-likelihood function @;(A) = log g,(a) in the new parametrization for 
A = ¢(6) will be 


where we have let / denote the likelihood function in the @-parametrization. In other 
words, we would have 


(A) > b(A2) <> €(0,) > &,(02) whenever $(9;) = A;,i = 1,2. 


So low or high values of @,,() match low or high values of £,.() if \ = ¢(8), re- 
flecting that the likelihood function is equivariant under bijective reparametrizations. 


1.2.1.3 Likelihood as a random variable 


Although the likelihood appears by considering the density for a fixed x as a func- 
tion of the unknown parameter 9, different likelihood functions appear for different 
outcomes «x of the random variable X associated with the model under investigation. 

Thus the likelihood function itself, and its logarithm, can be considered as a ran- 
dom function Lx (-) and £x(-) and we shall do so without paying specific attention 
to the o-algebra making the corresponding map measurable; when a specific choice 
of the family F of densities has been made, the map and notion are defined without 
ambiguity. 
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n = 100 


¢ — £(3) 


Figure 1.3 — Five realizations of the Poisson log-likelihood normalized with the value 
of the function at the true value 4 = 3 for n = 5 and n = 100 observations. The true 
value is indicated by a vertical line. 


An example of the variability of these random likelihood functions for the simple 
Poisson model is displayed in Figure 1.3. We note that the variability of the likelihood 
function is much larger in the diagram to the left, corresponding to n = 5 than in the 
diagram to the right, where n = 100. We also note that the log-likelihood function is 
more peaked for n = 100 and the peak occurs close to the true value of the parameter. 
In addition, the shape of the log-likelihood function is not far from a parabola when 
nm = 100. We shall later—in Chapter 5—see that this is not a coincidence; rather 
a typical behaviour of the log-likelihood function. These fact suggest that the value 
of @ at which the log-likelihood peaks could be a good guess for the true value of 
that parameter. Indeed this shall later—in Section 4.3—be formalized into the idea 
of maximum likelihood estimation. 


1.2.2 Score and information 


Consider now a ju-dominated family of probability measures on (4, E) with associ- 
ated family of densities F = { fg |@ € ©} where © C R* is open and recall that a 
function g : R* + R" is said to be smooth if g is infinitely often differentiable, i.e. 
if g € C™. If © is an open subset of R* and the likelihood function is smooth for 
p-almost all x, we say that the family is smooth. 


Remark 1.19. We emphasize that almost all results in this book would hold if 
‘smooth’ is interpreted as twice continuously differentiable (C*) rather than infin- 
itely often differentiable (C°°). But for simplicity we restrict to C°-functions to 
avoid having to keep track of the degree of differentiability. 


We now introduce two fundamental quantities. 


Definition 1.20. The score function for a smooth family is 
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and the information function is 


07 lx(0) 7 Ln (8) 07 Lx (8) 
002 301 002 001 00;. 
07Ln(0) 07 Lx (0) a? ba (9) 
00200 002 “"" 06200 
I,(0) = I(x,0) =—Dl,(0)=-} — 
O7la(0) O?Ln (0) OL (8) 
00,001 800x002 302 


In other words, the score function is the derivative and the information function is 
the negative Hessian of the log-likelihood function. The arbitrary constant associated 
with the log-likelihood function disappears in the differentiation process. 

The idea behind the score function is that it measures how sensitive the log- 
likelihood function is to changes in the parameter 0, the logic being that if it does 
not change much with the parameter @, there is little information about 6. The formal 
measure of information is then obtained by a second differentiation. Also, the score 
function is zero at any local maximum of the log-likelihood function, so a zero score 
at 6’, say, might indicate that the true value is close to 6’. As we shall show below 
in Theorem 1.23, the expectation of the information function is equal to the variance 
of the score function under mild regularity conditions. Also, since the log-likelihood 
function @, often approximately has the shape of a parabola—see Fig. 1.3—the first 
and second derivative of @, roughly identifies the shape and location of the log- 
likelihood function. 


Example 1.21. [Normal score and information] It is illuminating to understand the 


score and information function by considering a normal model with known variance 


o7, i.e. the family of densities 


where we have ignored additive constants that do not depend on @. This yields the 


score function 
x—@ 


S(x,6) = £4 (8) =, 


and the information function is obtained by changing sign and differentiating once 
more 1 
T(x, 6) = —€,(8) = 3 = VotS(X,8)} = Bot 1(X, 8)}. 

Thus the information is simply the inverse variance, or concentration, supporting the 
intuition that the information function captures the accuracy available to determine 0. 
We shall later, in Chapter 4 and Chapter 5, see that this interpretation can be extended 
to more general situations: the Fisher information yields the inherent precision with 
which a parameter may be determined from data. 
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The identities in the last line are known as Bartlett’s identities and established for 
general models under suitable smoothness and regularity conditions in Theorem 1.23 
below. 


We consider a z-dominated parametrized family P = { Py, 0 € ©} of probability 
measures on (1’, E). We define 
Definition 1.22. A smooth family F = {fo | € ©} of densities with respect to 
pt is said to be stable if every 8 € O has an open neighbourhood Us and there are 
p-integrable functions gg and he such that for all x € ¥, 71,7 = 1,...k, and all 
n € Us: 


7) Oo? 

In words, the derivative DL,.(@) and Hessian D? L,,(0) of the likelihood function 
are uniformly locally bounded in a neighbourhood of @ by integrable functions. 

The condition (1.4) ensures that differentiation with respect to 6 of integrals of 
these functions commutes with integration so differentiation can be performed inside 
the integration sign, see for example Schilling (2017, Theorem 12.5). 

As we may consider the log-likelihood function ¢ (@) as a random object, so we 
may also consider the score S(X, @) and information I(X, @). We have the following 
important identities for their simple moments: 


Theorem 1.23 (Bartlett’s identities). Consider a statistical model on (X,E) with a 
pi-dominated family {Pg |@ € O} where the associated family of densities is smooth 
and stable. It then holds that 


Be {S(X, 8)} =0, Vo{S(X, 6)"} _ F9{5(X, 0)" S(X, A)} = Eo {I(X, 4)}. 


Ltn) race (1.4) 


Proof. Since the family is smooth and stable, we may differentiate under the integral 
sign. As 


we get by differentiation 


oe az ( I foe) w(de)) = | hate) was) 


a 
= 30, f6(2) _ fa) a 
aan Flay 100) Hae) _ I 99, (8) fol) y(dx) = Eo{S(X,0);} 


and the first identify is established. 


Differentiating a second time we find 


0= (f spptole) wae) = f io, 102) wae) 


8? fo(X) 
30,00; 


o2 
fo(X) ( if 06,00, fo(x) u(dx) = 0 


so also 


Eo 
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and since 
Oy ie wise; fox) 2 fa(x) 39, fo(a) 
—__¢(9,2) = : 
30,00; fo(«) fo(z) So(«) 
Ee? 
50,00, fo(2) 
1 — S(a2, 0); 5(2, 0),, 
fo(x) (x ) (x i 
we further obtain, using that Eg {S(X, 0)} = 0, 
v sp 
Be {~ spay; 6X0} = Bo) BE) + Vol SUX.) 
= Vo{S(X,9)}i; 
as desired. 


For a smooth and stable family, the quantity 
10) = Vo{S(X, )} = Eo{ 1(X, #)} 


is known as the Fisher information (matrix) or expected information and it plays a 
central role in the following developments and in many other areas of mathematical 
statistics. 

We need another important quantity associated with smooth statistical families 
and define: 


Definition 1.24. [Quadratic score] Consider a statistical model on (4, E) with a 
ji-dominated family {P| 0 € ©} where the associated family of densities is smooth 
and stable and the Fisher information matrix is positive definite. The quadratic score 
is the quantity 


Q(X, 0) = S(X,8)i()"5(X, 6)" 
where i(@) is the Fisher information and S(X, 0) the score statistic. 


Thus the quadratic score measures the length of the score statistics with respect 
to the inner product determined by its inverse variance, i.e. the inverse Fisher inform- 
ation. We then have 


Theorem 1.25. Consider a statistical model on (X ,E) with a p-dominated family 
{Po | € ©} with © C R* anopen set. If the associated family of densities is smooth 
and stable and i(6) is positive definite, it holds that 


Eo{Q(X,6)} =k. 


Proof. We have 
Eo{Q(X,0)} = Eg {S(X,0)i(0)'S(X,0)"} 
: ) 
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as desired. We used Bartlett’s identities from Theorem 1.23 for the penultimate equa- 
tion and properties of the trace of a matrix in several of the steps above; see Sec- 
tion A.5 for the latter. 


Example 1.26. [Poisson score and information] In the simple Poisson model, we 
have previously (see Example 1.17) calculated the log-likelihood function to be 


(A) = clogrA— A 


which is clearly smooth, so the family of densities for the Poisson model is smooth. 
By differentiation we further get 
_ « x 


S(a, 2) 5 1, I(z,) = x3. 

We shall in the next section argue that the family is also stable. This implies that 
Bartlett’s identities hold, which in this case can also be verified directly: 

E(X) 


Ee iene at ee 
aN a : 


E{S(X,A)} = 


and we may calculate the Fisher information in two different ways: 


iQ) =V{S(X)} = VON = 4; AQ) = BUH, »)} = BS =F 


where we have used that a Poisson distribution has E,(X) = V(X) = 4, so the 
identities are satisfied. The quadratic score becomes 


Q(x,\) = (< -1) a= aera 


and we may easily verify that E,{Q(X, A)} = V)(X)/A = 1, as Theorem 1.25 
says. 


Since the quadratic score has expectation equal to the model dimension and the 
score itself has mean zero at the true parameter, large values of the quadratic score at 
a parameter value will point to that parameter value not being true. In other words, the 
quadratic score Q(x, @) behaves in a similar way to the negative of the log-likelihood 
function —£(a, 0) and indeed may be seen as a quadratic approximation to —¢(x, 0). 


Example 1.27. [Score and information in the gamma distribution] For a two- 
dimensional example, consider the gamma family in Example 1.6 with densities 


a-l 


fa ®) = Garay?) >0, (a8) ER? 


The log-likelihood function becomes 


l(a, B) = (a— 1) log x — alog 8 — logI'(a) — 3 
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Also this family is smooth and stable, as we shall see in Chapter 3, so Bartlett’s 
identities hold. The score function is obtained by differentiation 


S(z,a, 8) = (tog log 8 — Wa), —S + =) 
Bp B 
where U(a) = DlogI'(a) is known as the digamma function. 


As E(q,3)(X) = 8 we see directly that the second component has mean zero. 
From the fact that also the first component must have mean zero, we deduce that 


E,a,8) (log X) = log B + W(a). 
Changing signs and differentiating a second time yields the information function 
F (Vila) 3 
(z, a, 8) = 1 2x2—aB 2 
B Be 


Here VW; (a) = W'(a) = D? logI'(a) is known as the trigamma function. Taking 
expectations yields the Fisher information matrix 


Vy a L Wy Qa 
i(a, B) = ( (a) rasa = ( 2) 
B3 


1 
B 


BIR 
Qe ole 
Se 


Note that from Bartlett’s identities, we have that 


i(a, B) = E{I(X, a, B)} = Va,e{S(X, a, B)} = Va,a{(log X, X/B")} 


which conforms with the fact that 
Va,a(X/8") = af? /8* = a/B?. 


As we Shall see later, there are also other ways of deriving these results. 


Example 1.28. [Uniform not smooth] The uniform distribution is not smooth, as the 
likelihood function is not differentiable at 9 = max(21,...,@n), see Figure 1.2. So 
score and information is not well-defined for this type of model. 


1.2.3 Reparametrization and repetition 


It is important to understand how likelihood, score, and information behave when 
we reparametrize a model. We first establish that the properties of smoothness and 
stability are preserved by diffeomorphisms, i.e. reparametrizations that are bijective 
and smooth in both directions. 

Lemma 1.29. Consider a family of densities F = { fg | 9 € ©} with © C R* open, 
and a smooth and bijective map ¢ : © +> A with det(Dd(@)) 4 0. Then ¢ is a 
diffeomorphism and F is smooth and stable if and only if the reparametrized family 
F = {g,|A € A} is smooth and stable. 
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Proof. The inverse function theorem (Theorem A.20) ensures that ¢ is a diffeo- 
morphism. The statements concerning smoothness follows as this property is closed 
under composition of functions. We let L.(@) and L(A) denote the likelihood func- 
tions expressed in the two parametrizations. Using composite differentiation (Rudin, 
1976, Theorem 9.15) yields 


OL2(9) _ ~~ ALe(A) Iou(9) 


00, <4 Omw 08; 


or, in matrix form - 
DL,(9) = DLz(9(9)) De), 
where D@(@) is the Jacobian of ¢. So the left-hand side is locally bounded by integ- 


rable functions if DL,((@)) is. 
Differentiating a second time yields 


PLB) _ yr PLn(A) Abul) Abu(9) | yr ALa(A) O?ou(8) 
00:00;  — OO, 06; 80; | — Dry 00,08; 


which again is locally bounded by integrable functions if the derivatives on the right 
hand side are. To see the converse we just reverse the roles of 6 and \ which can be 
done since the inverse function theorem ensures that ¢~! also satisfies the conditions 
of the lemma. 


Lemma 1.29 ensures that for many purposes, we do not have to worry about 

which parametrization we use for a given model, as long as reparametrizations are 
diffeomorphisms. However, we must track how the score and information change 
when parameters change, and this is summarized in the theorem below. 
Theorem 1.30. Consider a family of densities F = { fg |@ € ©} that is smooth and 
stable. Let 6 : O ++ A be a diffeomorphism as in Lemma 1.29. Further, let S, I, 
i, and Q denote the score, information function, Fisher information, and quadratic 
score with respect to ; and §, I, 1, and Q the corresponding quantities with respect 
to @. It then holds that 


S(x,0) = S(x, \)D¢(6) (1.5) 
I(a,0) = Dd(0)'I(a, A)DO(0 ae r, \)yD?ou(8) (1.6) 


i(@) = D6(8)'i(o(8))De(4) = Dold )Ti(A)De(8), (1.7) 
Q(x, 9) = Q(x, A), (1.8) 
where D@ is the Jacobian of ¢ and D?¢,, is the Hessian of $y. 


Proof. Lemma 1.29 yields that all quantities are well defined in the reparametrized 
model. Composite differentiation applied to the score function yields 


Olx(A) OGu(9) 


5(x,0) = Dé,(0) = Dé.(¢(0)) Dd(6) = Dr, 00; 


= S(x, A) D4(8) 
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which establishes (1.5). For the information function, we get by further differenti- 
ation 


OAyOA, 00; 06; Oru 06,00; 


which in a more compact form is (1.6). Then (1.7) can be derived from (1.5) as 


(0) = Vo{S(X,6)} = Eo{S(X,0)" S(X,6)} 
Do(9) "By {S(X, A)" 8(X, A)} Dd) = De(9) "i(A)D9(9) 


or alternatively from (1.6) since 


E) 2 S(X, ».0%64(0)} = SCE) {5(X, \)u} D’bu(6) = 0 
i(0) = Ey {1(X, 0)} = E,{Do(0) "1(X, A) Dd(8)} = DG) iA) D (8). 


Finally, we get 


Q(x, 0) S(x,0)i(0)-1S(x, 0)" 
= $(x,d)De(8) (D6(0)"i(A)De(4)) | De(A)™ S(a,d)™ 


= S(x,r)i(A)“1S(z,)'. 


This completes the proof. 


Note in particular that the quadratic score is equivariant as was also true for the 
likelihood and log-likelihood function. 


Example 1.31. [Score for reparametrized Poisson] Consider again the simple Pois- 
son family, but let us parametrize the family with 9 = log J instead of the mean X. In 
Example 1.26, we calculated the score, information function, and Fisher information 


for to be 1 
x x ‘ 
S(a,A)= 7-1, I(z,r) = 3, i(A) = 5. 
To use the identities in Theorem 1.30, we note that Dd(@) = e® so we get 


I(x, 0) = eS — S(a, Ae? =e-a te’ =e! 
and As%s 
x r—e 
bivoyss 5 ) = Q(z, ). 
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Clearly we get the same results by calculating the quantities directly in the 6- 
parametrization. Then the Poisson density becomes 


ef 6 


fo(x) = eo - 
so we get by differentiation (ignoring x! as usual) 
,(0)=20—e®; S(x,0)=x2—e°; I(x,6) =e° =i(8). 


The quantities are often most easily found by direct calculation as we did here at the 
end, but Theorem 1.30 is important for theoretical considerations. 


Example 1.32. [Exponential distribution—rate or mean] The model for the simple 
exponential distribution may either be parametrized by the rate or the mean. Para- 
metrizing with the rate yields 


fla) sre *,. ae > 0 


with associated log-likelihood, score, information, and quadratic score being 


1 
(A) =logA—Az, S(x,0) = 8 


He Naa OG She 7 


ae 


The mean of an exponential distribution with rate \ is E,(X) = 1/A, conforming 
with the property that E, {S'(X, \)} = 0 and Ey {Q(X, A)} = 1. Now parametrizing 
with the mean 0 = 1/, yields 


1 
fo(z) = qo z>0 


with associated log-likelihood, score, and information being 


eS ~ 1 « ~ 22 1 
yielding the Fisher information 
~ 20 = 1 
(0) = Eo (I(X, 9)) = 8 ge = 92 


and quadratic score 


O(x,0) = 6 (F Ay = (= 1)" = (Ar — 1)? = Q(z,)). 


We note that this conforms well with Theorem 1.30 since with ¢(@) = 1/0 we have 
D¢(0) = —1/6?, D?6(0) = 2/6° and thus, for example, using (1.6) we get 


1a ( “eA 3 6-22—-2_1 


64 2 63 Bb BGP 
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as we also found above. Similarly (1.7) yields 


: pracis = Aa - A 
(8) = iO) = 729 = 


Note that also in this case it seems simpler to calculate the quantities directly than 
using the formulae in Theorem 1.30. 


In the special case where the reparametrization is affine, the relations involving 
score and information simplify. 
Corollary 1.33. Consider a family of densities F = {fo|0 € ©} that is smooth 
and stable. Let 6 : 0 ++ A be an affine and bijective reparametrization, i.e. X = 
0(0) = a+ BO. Let further S, I, and i denote the score, information function, and 
Fisher information with respect to \; and §, I, and i the corresponding quantities 
with respect to 0. Then it holds that 


S(x,0) = S(a,A)B (1.9) 
I(2,0) = B'I(A,2)B (1.10) 
(0) = B'i(¢(0))B = B'i(A)B. (1.11) 


Proof. This follows immediately from Theorem 1.30 since D¢(@) = B and 
D?¢(0) =0. 


Also here is important to understand the behaviour of score and information when 
we have repeated observations. So consider a sample X = (Xj,...,Xn) of inde- 
pendent and identically distributed random variables with distributions from a para- 
metrized family P = {Po |0 € O}. 

Theorem 1.34. Assume that P is smooth and stable. Then so is P®". The score 
and information functions S;,, In, and in for P®” are related to the corresponding 
functions S, I, andi for P as 


n 


Sn(X,0) = - S(Xi,0),  In(X,0) = S°1(Xi,0),  in(0) = ni(8) 


i=l 
and the quadratic score becomes 
Qn(X, 0) = Sn(X, )in(0)*Sn(X, 0)’ = nS, (X, 0)i(0)-18,(X, 0)" (1.12) 
where S,,(X,0) = S;(X, 0) /n is the average score. 


Proof. This is immediate from the definition of the concepts involved. 


As a consequence we have the following: 


Corollary 1.35. Assume that P = {Po |@ € ©} is smooth and stable with © C R* 
and assume the Fisher information i(0) is positive definite. Then the quadratic score 
Qn(X, ®) associated with P®” for n repeated samples converges in distribution as 


OX) Saas oe 


where x2(k) denotes the x?-distribution with k degrees of freedom. 
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Proof. From the Central Limit Theorem (Theorem A.16), it follows that 


The result now follows from Theorem A.11 and (1.12). 


1.3. Exercises 


Exercise 1.1. Consider the simple normal model in Example 1.5 with the modifica- 
tion that the mean € is restricted to € € R4 and reparametrize the model in terms of 
the mean and coefficient of variation 


C(X) = /V(X)/E(X). 


Exercise 1.2. A drawing pin is thrown several times at random until it has landed 
five times flat on its head, and the total number of throws needed to achieve this is 
recorded. Describe a suitable statistical model for this experiment, performed with 
the purpose of determining the probability that the pin lands on its head. Hint: use 
the negative binomial distribution. 


Exercise 1.3. If Y = log X and Y ~ N(€,o7), X is said to have a log-normal 
distribution. Identify the mean and coefficient of variation in this distribution and 
parametrize the family with these quantities. 


Exercise 1.4. Consider the simple normal model in Example 1.5 and determine the 
score, information function, Fisher information, and quadratic score in the paramet- 
rization using the mean € and standard deviation o. 


Exercise 1.5. Consider the simple Poisson model, parametrized with the null fraction 
as in Example 1.10 and determine the score, information function, Fisher informa- 
tion, and quadratic score in this parametrization. 


Exercise 1.6. Consider the simple normal model as in Exercise 1.1 and determine 
the score, information function, Fisher information, and quadratic score when para- 
metrized with mean and coefficient of variation. 


Exercise 1.7. Consider the family of log-normal distributions as in Exercise 1.3 and 
determine the score, information function, Fisher information, and quadratic score 
when parametrized with the mean and coefficient of variation. 


Exercise 1.8. Consider the family of negative binomial distributions as in Exer- 
cise 1.2 and determine the score, information function, Fisher information, and quad- 
ratic score in a suitable parametrization. 


Exercise 1.9. Consider the family of distributions with densities 
fo(x) = a(0)x°"Lon(z), @€O=Ry, 


with respect to standard Lebesgue measure on R, where a(@) is a normalizing con- 
stant. 


a) Show that a(@) = 0; 


EXERCISES 23 

b) Show that Y = — log X, where X has the density above, follows an exponential 
distribution with mean Eg(Y) = 1/6; 

c) Find the score, information function, Fisher information, and quadratic score; 


d) Verify Bartlett’s identities directly by calculating relevant quantities and showing 
that they are equal. 


Exercise 1.10. Consider the Cauchy model in Example 1.9 for the case when the 
scale parameter 3 = 1 is known. 


a) Determine the score and information function; 


b) Investigate by simulation whether Bartlett’s identities hold in this case. 


Taylor & Francis 
Taylor & Francis Group 


http://taylorandfrancis.com 


Chapter 2 


Linear Normal Models 


Linear normal models are the basis of a huge body of statistical methods and we shall 
briefly describe the basic elements of these. In particular we give a short treatment of 
the normal distribution on a Euclidean vector space, but abstain from a full treatment 
of the statistical issues of linear models, as this would require and deserve a text on 
its own. 

The most important results for the developments in the following chapters are 
the decomposition theorems: Theorem 2.24 and the extended form in Theorem 2.27. 
The proof of these theorems can be quite cumbersome in some expositions. Here a 
geometric interpretation and proof is given which hopefully is both simple and illu- 
minating. To achieve this properly, it is most convenient to discuss the normal distri- 
bution as a distribution on a finite-dimensional vector space with an inner product, 
in other words a Euclidean space. This will be done in Section 2.2. Before we move 
into this generality, we first recall some standard results for the multivariate normal 
distribution on R@ in the next section. 


2.1 The multivariate normal distribution 


Recall from, e.g. Jacod and Protter (2004, Ch. 16) that X = (X1,..., Xa)' follows 
a multivariate normal distribution on R? with mean € € R@ and covariance matrix 
d € R¢*¢ if it holds for any \ = (\1,...,Aq)' € R¢ that the corresponding linear 
combination is univariate normal with mean \! € and variance \' DA: 


AX NATE, XD). 


Here and elsewhere we let \V(a,0) = 6, denote Dirac measure at a, i.e. the distribu- 
tion with all mass at a: 


Ja(B) =12(a), forall B € B(R). 


Then © = {o;,} is a positive semi-definite (hence symmetric) matrix, E(X;) = &;, 
and V(X;,X;) = 0;;. We then write X ~ Na(€,%). 

We first show that the distribution is well-defined and recall that any distribution 
of arandom variable X with values in R* is uniquely determined by its characteristic 
function (Jacod and Protter, 2004, Theorem 14.1) 


yx(u) = E(e™'*), 


DOI: 10.1201/9781003272359-2 25 


26 LINEAR NORMAL MODELS 


Proposition 2.1. Let & be any positive semidefinite d x d matrix and € € R™. Then 
the normal distribution Na(&, ©) exists and has characteristic function 


px(u) = E(e' *) = expfiu'é—u! Du/2}. (2.1) 


Proof. Let Z1,...Zq be independent and identically normally distributed as N’(0, 1) 
and let Z = (Z,,... Zq)'. The characteristic function yz of Z is then 


d 


gz(u) = [J e-**/? = exp{—|lull?/2}. 


I=) 


Since ¥ is positive semidefinite, the spectral theorem (Horn and Johnson, 2013, Co- 
rollary 2.5.11) implies that there is an orthogonal matrix U and a diagonal matrix 
A such that © = UAU'. The diagonal elements of A are the non-negative eigenval- 
ues Ay >--- > Aq of U; some of these may be identical, and some may be zero if U 
is only positive semidefinite. We then let VA be the diagonal matrix with /, in the 
diagonal, let A = UVAU', and X = € + AZ. Note that since U is orthogonal we 
have U'U = I, and thus 


AA! =UVAU'UVAU! =UAU! =Y. 


It now holds that any linear combination of the coordinates of X corresponds to an 
affine combination of elements of Z as 


MX=AE+(ATA)Z 
and thus 
AMX ws NE, ||AT Al]?) = MOE, ATAATA) = NOE, ATEA) 
so X ~ Na(é,X), as desired. We get for the characteristic function 


px(u) = E(e'*) = E(e™ | (§+42Z)) a cit" §B(ei" 42) 
= e@™'€y7(ATu) =e exp{—||A"ull?/2} 
gue exp{—u! AA! u/2} = exp{iu'é—u! Du/2}. 


The result follows. 


If & is positive definite, 4 is also invertible, and the inverse K = »~! is the con- 
centration matrix of the normal distribution. We then say that the normal distribution 
is regular; then the distribution of X has a density with respect to standard Lebesgue 
measure on R4, i.e. the Lebesgue measure giving measure | to the unit cube 


{ue R4|0<u;<1,i=1,...d}. 
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Proposition 2.2. Let € € R4, © € R**? be a positive definite matrix, and K =~". 
Then the normal distribution Na(&, =) has density 


fexy(e) = (2n)~4/?(det E)~/ exp{—(« — €) "E(w — €)/2} 
= (2n)~4/?(det K)/? exp{—(a — £)' K (a — £)/2} 
with respect to standard Lebesgue measure on R¢. 


Proof. See Corollary 16.2 of Jacod and Protter (2004) although their formula (16.5) 
has a typographical error. 


Example 2.3. Consider the case d = 2 and let 


fi [1 2% 
-().(69) 


If —1 < p <1, this is a regular normal distribution since det U, = 4(1 — p”) > 0. 
The concentration matrix becomes 


ee 4 —2p 
pe ACL =p?) \-2p 1 


and the distribution therefore has density 


1 

fen, (@) = after (x 
1 

oe ae Qp(a — €)/2} (2.2) 

i (1-1)? | p(ti1—1)(@2—2) (a2 — 2)? 

a wer) exp{ A1—p) ap) era 


with respect to Lebesgue measure on R?. Here p is the correlation between X, and 
X». The contours of the density are ellipses centred at € = (1,2), with major axes 
and rotation depending on the correlation, see Figure 2.1. 


Affine images of normal distributions are again normal. More precisely 


Proposition 2.4. Let X ~ Nu(€,X), a € R™, and let B € R™*4 be anm x d 
matrix. If Y =a+ BX thenY ~ N,,(a+ BE, BZB"). 


Proof. Clearly Y takes values in R™. The characteristic function of Y is 
py (u) = E(e'¥) = E(e™" (@+BX)) = cit aB (eit BX) 


= e454 (BTu) = e" 4 expfiu' Bé—u' BEB‘ u/2} 
= expfiu'p—u! Uu/2}, 


where pp = a+ BE and V = BYB'; the result follows. 
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p=3/4 
p=-3/4 

> X11 
p=-l 


Figure 2.1 — Contour ellipses for the density of normal distributions with mean equal 
to € = (1,2)", variances V(X1) = 1 and V(X2) = 4, and varying correlations 
p = —3/4,0,3/4. Each ellipse indicates when the quadratic form Q, in the exponent 
of (2.2) takes the value 1. When p = —1, the distribution degenerates to be supported 
on the affine subspace x2 = —22, + 4, as indicated by the thick line in the diagram; see 
Example 2.8 for details. 


Example 2.5. As a special case of this, assume the random vector X partitioned into 
components X, and X2, where X, € R” and X2 © R* with r + s = d. Its mean 
vector and covariance matrix can then be partitioned accordingly into blocks as 


(8) sag n= (3 =) 
&2 Xai Lage 


such that ©}; has dimensions r x r and so on. Let X be distributed as Vz(€, %), 
where X, € and »& are partitioned as above. Then the marginal distribution of X, is 
also normal: 


X1 ~ N,(&1, 511). 
This follows by letting a = 0 and B = (J, : 0,.,) in Proposition 2.4. 


Independence of components in the multivariate normal distribution may easily 
be checked via the covariance matrix. More precisely, we have: 


Proposition 2.6. Assume the random vector X is partitioned into components X 1 
and X», where X, © R" and Xo € R*® with r + s = d and its mean vector and 
covariance matrix partitioned accordingly as in Example 2.5. Then X, and Xz are 
independent if and only if X42 = 0. 


Proof. We get for the characteristic function 
yx(u) = exp{—iul€-u'du/2} 
=> elt Si tity &2 exp{—u, ©41,u;/2 = uy Dggtte /2 = ut Dy2t2} 


= 9x, (u1)¢x,(u2) exp{—u] Ly2u2} 


THE MULTIVARIATE NORMAL DISTRIBUTION 29 


and hence yx(u) = yx, (ui)px,(u2) if and only if Sig = 0 whence the result 
follows. 


Combining Proposition 2.1 and Proposition 2.4, we obtain a structure theorem 
for multivariate normal distributions. 


Theorem 2.7. Let X be a random vector with values in R¢, EE R¢@ and let 5 € 
R¢*¢ be positive semidefinite. Then X ~ Na(€,%) if and only if there is a matrix 


A € R®4 with AAT = Sand X 2 €+ AZ where Z ~ Na(0, Ia). Here Iq is the 
d x d identity matrix. 


Proof. lf AA' =X and Y = € + AZ Proposition 2.4 yields that 
Yr Nal€, AA‘) = Nal€, x). 


The converse is Proposition 2.1. 


The distribution of € + AZ is supported on the affine subspace € + rg(A) where 
rg(A) = A(R) is the range of A, i.e. the image of the linear map with matrix A. If 
© is regular, rg(A) = R@ and hence € + rg(A) = £ + R¢ = R®. If ¥ is only semi- 
definite, we say that the normal distribution is singular and it does not have a density 
with respect to Lebesgue measure on R%, but it is rather supported on a proper affine 
subspace of R¢. 


Example 2.8. Consider the family of bivariate normal distributions in Example 2.3 
but assume this time that p = —1 so the covariance matrix is 


eft = 
SS _9 4 : 


This is singular since det U_; = 0. Indeed %_, has eigenvalues 5 and 0 with the 
vector u = (1,—2)' an eigenvector for the eigenvalue 5 and v = (2,1)! for the 
eigenvalue 0. Thus we have S_; = AA', where 


ifr 2 1 
pa oe ose 
a(J, 4 Jo 


xe ()e4(2 2) (2), 


where Z = (Z1, Z2)' ~ N(0, Iz). We conclude that this distribution is supported 
on the affine subspace of points satisfying the relation x2 — 2 = —2(a, — 1) or, 
equivalently, the line with equation x2 = —2x7, +4; see Figure 2.1 for an illustration. 
An alternative representation of the distribution is as 


and thus 


X2é+ BW, 


where W ~ N(0,1) but B = (1, —2) since then also ©_1 = BB’, so the repres- 
entation in this form is far from unique. 
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Finally we recall the important .?(k)-distribution: 
Definition 2.9. The y?(k)-distribution, where k are the degrees of freedom is the 
distribution of 
Zits + Zi 
where Z,..., Z, are independent and identically distributed as \V(0, 1). 
The y?(k)-distribution is also a Gamma distribution ['(k/2,2) with shape para- 
meter k/2 and scale parameter 2, i.e. it has density 


gh/2-1 p 
= ae, 0 
f(x) FT (k/2) », w&> 
with respect to Lebesgue measure on R,; its characteristic function is 
= it(Zp+e-+Zp)) — itZ? yk _ 1 
by2(k) (t) E(e 1 c ) E(e ‘) (1 7 2it)*/2 


and it has mean & and variance 2k. See, for example Jacod and Protter (2004, Ex- 
ample 15.6) for some of these facts. 


2.2 The normal distribution on a vector space 


2.2.1 Random vectors in V 


Let (V, (-,-)) be a Euclidean space and let B(V) be the Borel sets of V i.e. the 
a-algebra generated by open sets. As for R¢, a probability distribution on (V, B(V)) 
is uniquely determined by its characteristic function 


px(u) = E(e*)), 


In particular, the normal distribution with mean € € V and covariance © is the 
distribution with characteristic function 


px (u) = exp{i(u, £) — (u, Bu)/2}. 


Here © € L(V,V) is a positive semidefinite and self-adjoint linear map so that for 
all u,v € V we have 


(u, Su) > 0 and (u, Xv) = (Xu, v). 


In other words, X has a normal distribution on V if and only if any linear form 
(u, X) follows a normal distribution on R as Y, = (u,X) ~ N((u,€), (u, Xu)). 
This follows since the characteristic function of Y,, is 


yy, (t) = E(e*%) = E(ei#(4.X)) a E(cilt-X)) 
exp{i(tu, €) — (tu, Stu) /2} = exp{it(u, €) — t?(u, Du) /2}. 


Then the mean € € V is determined by the equation 


I 


E((u, X)) = (u, &) for all u € V. 
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The map » is the covariance operator of the distribution or, briefly, the covari- 
ance. This reflects the fact that for all linear forms determined by u, v € V we have 


V((u, X), (v, X)) = (u, Dv) (2.3) 
which follows by using Proposition A.2 and the relation 
V((u, X)) = (u, Bu). 


For a given orthonormal basis, we define the mean vector and covariance matrix 
of arandom variable on V as the mean vector and covariance matrix of its coordinates 
with respect to this basis, i.e. for the normal distribution above the vector and matrix 
with entries 


& =E((ei,X)), 13 = V((ei, X), (7, X)) = (ea, Hes) 


and we shall use » to denote both the covariance operator and its matrix with respect 
to a given basis. We have the following analogue of Proposition 2.4. 


Proposition 2.10. Let (V, (-,-)y) and (W, (-,-)w) be Euclidean spaces and assume 
X ~Ny(E,%). Fora € W and B € L(V, W) it holds that 


Y=a+BX ~Ny(a+t+ BE, BUB*). 
Proof. The characteristic function of Y is 


ae E(w) = OES) = e(ua)w Fy(e(B™u.X)v ) 
= aw oy (Bru) = 9 exp{i(B*u, €)v — (B*u, EB*u)y /2} 
= “Ow exp{i(u, BE)w — (u, BEB*u)w /2} 
= exp{i(u, n)w — (u, ¥u)w/2}, 


where pp = a+ BE and V = BX BH"; the result follows. 


Further we have a structure theorem, similar to Theorem 2.7: 


Theorem 2.11. Let X be a random vector with values in the Euclidean space 
(V,(-,-)), € € V, and let % € L(V,V) a positive semidefinite and self-adjoint linear 
map. Then X ~ Ny (E€,™) if and only if there is a linear map A € L(V,V) with 
AA* = &. Then X has the same distribution as Y = € + AZ where Z ~ Ny (0,1) 
and I is the identity map on V. 


Proof. Let €1,...,€q be an orthonormal basis for V and Y = (Yj,..., Ya)! the 
coordinates of X with respect to this basis. Then Y ~ Ny(u, T°) where ju are the co- 
ordinates of € andT’;; = (e;, Le;) the representation of ¥ in this basis. Theorem 2.7 
yields that 


Y 24+ BW, 
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where [ = BB' and W ~ N3(0, Ia). Now let Z be the random variable in V with 
coordinates W and A the linear map represented by B in the chosen basis, i.e. 


d d d 
Z= So Wie, Au = 3° 5 > Bij (u, jus. 
i=l 


i=1 j=1 


Then Z ~ Ny (0, I), 5 = AA* and X 2 € + AZ, as required. 


The distribution of the random variable Y in Theorem 2.11 is supported on the 
affine subspace €+rg(A) with rg(A) being the image of the map A. Although the co- 
variance & may be represented as © = AA* ina multitude of ways, see Example 2.8, 
the space rg(A) does not depend on this representation since Lemma A.6 yields that 
rg() = rg(A). We thus have: 


Corollary 2.12. The normal distribution Ny (€,™%) is supported on the affine sub- 
space € + rg(X) of V. 

We next define a concentration operator or just concentration of the normal dis- 
tribution Ny (€, ©) as any positive self-adjoint linear map K € L(V,V) satisfying 
for all u € rg(X) and all v € V that 


(Ku, Sv) = (u,v). (2.4) 


Proposition 2.13. A positive self-adjoint map K is a concentration operator for & 
if and only if it satisfies the relation 


LkU =. (2.5) 
Proof. If K is self-adjoint and satisfies (2.4) we have for all u,v € V 
(Sk xu, v) = (KNu, Dv) = (Lu, v) 


implying that DAU =v. 
Conversely, if LAS = ¥ we have for all w= “w and v € V that 


(Ku, Xv) = (KXw, Xv) = (UKXw,v) = (Lu, v) = (u,v) 
which shows (2.4). 


A positive and self-adjoint map /x that satisfies (2.5) is also known as a general- 
ized inverse of Xi, sometimes written as K = U~. 


Example 2.14. If & is singular, a generalized inverse is far from unique. For ex- 


ample, if we have 
3.0 
v= ; 
0 0 


then any positive definite matrix of the form 


doves (‘° _ 
ab = (2.6) 
a 0b 


is aconcentration matrix for /. 
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Proposition 2.15. /f > is invertible, its inverse K = ~' is the unique concentration 
operator for &. 


Proof. If ¥ is invertible we may let K = S~! which then clearly satisfies UK = 
&. And conversely, if we pre- and post-multiply the relation UK, with K = D7! 
we conclude ky = K. 


A concentration operator defines an inner product (-,-) « on V by the relation 
(u, VK = (Ku, v). 


Even if 4 is singular and the concentration therefore is not unique, the restriction of 
the associated inner product onto the range of » is unique. 


Proposition 2.16. Let ky and Ke be concentration operators for &. It then holds 
forall u,v € rg(X) that 


(u,v) K, = (Kyu, v) = (Kou, v) = (u,v) K- 


Proof. By Proposition A.2, it is sufficient to show that the inner products agree on 
the diagonal. So let u = Xv € rg(X). Then for i = 1, 2 we have 


(Kyu, u) = (Kivu, Sv) = (XK;{ Xv, v) = (Xv, v), 


showing that the inner products coincide. 


Example 2.17. As a simple illustration of Proposition 2.13, we may consider the 


singular covariance 
3.0 
D3 — 
0 0 


as in Example 2.14 above. Here we have 


Sea | 


Then any matrix of the form (2.6) will satisfy for u,v € rg(X) 


(u,v) K = (a 0) Kap (;) = ae 


reflecting that this does not depend on a nor b. 


In the special case where » is invertible, the corresponding normal distribution 
has density with respect to Lebesgue measure on V: 
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Proposition 2.18. Let € © V and € L(V,V) be positive, invertible and self- 
adjoint. Letnow K =~". Then the normal distribution Ny (€, ©) has density 


det K1/? 
ftgs=)() = Gayaimvya &%P {lle ~ €lk/2} 2.7) 


with respect to standard Lebesgue measure on V. Here det K is the determinant of 
the concentration matrix with entries ki; = (Ke, e;) where €1,.--,€dimv is any 
basis for V that is orthonormal with respect to the inner product (-,-). 


Proof. This follows from Proposition 2.2 since the standard Lebesque measure on 
V is obtained by transforming the standard Lebesque measure on R?@ onto V via the 
coordinates with respect to any orthonormal basis; see Appendix A.1. 


If % is only semi-definite, we say that the normal distribution is singular and it 
does not have a density with respect to Lebesgue measure on V; rather Y = X —€ has 
density with respect to Lebesque measure on the subspace L = rg(%) which itself is 
a Euclidean space. For the sake of completeness, we mention that this density is very 
similar to that above: 


dety, K1/? 
fesy = On) din TP exp {-|lylli/2} 


where now dety, K is the determinant of the restriction of the concentration matrix K 


onto L with respect to an orthonormal basis €1,..., €aim , for L, or, in other words, 
dim L 
det, K = II a 
i=1 
where \, > --- > Aq are the ordered eigenvalues of %. We refrain from giving a 


formal proof of this, as this density plays no role in subsequent developments. 


2.2.2 Projections with respect to the concentration 


Let now L C V be a linear subspace of V and assume that © is regular so the inner 
product (-,-)« is well-defined on V. We shall refer to (-,+) « as the concentration 
inner product whereas we refer to (-,-) as the base inner product. The (orthogonal) 
projection II; onto L of an element u € V with respect to the concentration inner 
product is then determined as the unique point II;u € L satisfying for all w € L 


(u— Tpu,w)K =0 (2.8) 


or, equivalently 
(u — Izu, Kw) =0. 


Note that for simplicity we here and elsewhere omit the qualifier ‘orthogonal’ and 
simply say ‘projection’ to mean ‘orthogonal projection’ when there is no ambiguity. 
We have 
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Theorem 2.19. A linear map P € L(V,V) is a projection with respect to (-,-) K if 
and only if P. satisfies P? = P and P* K = KP, where P* is the adjoint of P with 
respect to the base inner product. 


Proof. Theorem A.3 gives that we just have to check whether P is self-adjoint with 
respect to (-,-) «. But we have for all u,v € V that 


(u, Pv) Kk = (Ku, Pv) = (P*Ku,v) and (Pu,v)K = (Kk Pu, v) 


whence the result follows. 


Remark 2.20. The condition for self-adjointness with respect to (-,-)K may be ex- 
pressed in terms of the covariance & as 


yP* = PX 


since we may pre- and post-multiply the relation P* K = K P with X. 


Common ways of specifying a linear space are as follows as an image of a linear 

map, typically from R*, ie. as L = rg(A) = A(R”), where A € CL(R*,V), or: As 
the kernel of a linear map, i.e. as ker(B) = {v € V| B(v) = 0} for B € L(V, R™). 
It is then useful to express the projections with respect to the concentration inner 
product in terms of A’, 4, and the maps A and B. Indeed we have the following, 
where we have replaced R* with general Euclidean spaces U and W. 
Proposition 2.21. Let (U, (-,-)v), (V, (-,-)v), and (W, (-,-)w) be Euclidean spaces. 
Suppose © € L(V,V) is a positive definite and self-adjoint covariance, L = rg(A) 
or L = ker(B), where A € L(U,V) is injective and B € L(W,V) surjective. Then 
A*K A and BK B* are both invertible and the projection Il, € L(V,V) onto L 
with respect to the concentration inner product (-,-) x is given as 


Tz, = A(A*K A)~* A*K or, equivalently, I, = I — B*(BK B*)~'BK, 
where I is the identity on V and K = 01. 
Proof. LetQ = A*KA:U — U and assume u € ker(Q). Then we have 
0 = (u, Qu)y = (u, A*K Au)y = (Au, KAu)y = || Aull}, 


whence we conclude that Au = 0 and therefore uw = O since A is assumed in- 
jective; hence we conclude that Q = A*K A is injective. The similar result for B 
follows by replacing A with B* which is injective by Proposition A.4. Next, let 
H = A(A*K A)~'A*K. Then 


H? = A(A*KA)"!A*K A(A*KA)'A*K = A(A*KA)"'A*K =H 
so #7 is idempotent. Further, it satisfies H* ik = KH since 


H* K = (A(A*KA)"'A*K)*K = KA(A*K A) 1 A*K = KH. 
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Finally we show that rg(H) = rg(A). The expression for H yields directly that 
rg(H) C rg(A). For the reverse inclusion, we let Au denote a generic element of 
rg(A) and get 

HAu = A(A*KA)7'A*K Au = Au 


and hence Au € rg(H), showing that rg(A) C rg(#), as required. 

The statements in terms of B follow by first replacing A with B* in the expres- 
sion for H above, implying that B*(BK B*)~!BK is the projection onto rg(B*), 
then using Proposition A.4 to conclude that rg(B*) = ker(B)+ = L+ and thus 
I — B*(BK B*)~'BK is the projection onto L. 


2.2.3 Derived distributions 


Here we briefly describe some important distributional results associated with the 
normal distribution. We first need a lemma. 


Lemma 2.22. Assume that X ~ Ny (€,=) on (V, (-,-,)) and let A € L(V, W) and 
B € L(V, W2) where (Wi, (-,-,)1) and (W2,(-,-,)2) are Euclidean spaces. Then 
Y = AX and Z = BX are independent if and only if AU. B* = 0. 


Proof. By Proposition 2.10 the distribution of (Y, Z) is normal on W, © W2 with 
mean (A€, BE) and covariance © 4p given as 


((w1, we), Sap(wi, we))12 = (wi, AX A* wy) +2(w1, AX B* we) + (wo, BY B* uz). 
We thus get for the characteristic function 


py,2(wiyw2) = EB (elllwn¥)t(weZ)2)) 


i((w1,A€)1+(w2,BE)2) 


= € 
KET (WL AZA*w1)/2—(w1, AEB" wa) —(w2,BEB* we) /2 
= py(wi)yz(wa)e er ABB) 


Thus we have that yy.z(wi,w2) = yy(wi)yz(we2) if and only if it holds that 
A> B* = 0, and the result follows. 


Note in particular the following corollary which is a minor modification of Proposi- 
tion 2.6. 


Corollary 2.23. Let (V,(-,-)) be a Euclidean space and X ~ Ny(&,%). Then 
disjoint sets Ya = (Yi,..-Ym)' and Yg = (Yin41,---;Ym+z)! of coordinates 
Y; = (e;, X) for X with respect to an orthonormal basis e = (€1,...,€a) for V are 
independent if and only if 


oij3=0 foralli=1,...,.m,j=m+l1,...,.m+k, 


where 04; = (€;, Le;) are the entries of the covariance matrix for X in this basis. 
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Proof. Let A be the linear map that sends X into the coordinates Y4 and B the map 
that sends X to the coordinates Yp. Then, in the basis considered, the matrix for 
A™B* is given as 


(AX B") i; = (e;, AUB" e;) = (ei, Hej) = 4; 


and the result follows from Lemma 2.22. 


We are now ready to show the following important result, known as the decom- 
position theorem for the normal distribution. 


Theorem 2.24. Let X ~ Ny (E,&) on (V, (-,-)) where dim V = d and assume that 
» is invertible with concentration K = S~!. Further, let L C V bea subspace 
of dimension dim L = m and let I, denote the projection onto L with respect to 
the concentration inner product (-,-)«. Let I denote the identity on V. Then the 
following hold: 

(a) The projections 11,.X and X —\I,,X are independent and normally distributed; 
(b) The covariances of Up,X and X —U,X are Ip and (I — Ip); 

(c) K is a concentration operator for both of p,X and X — 1.x; 

(d) If€ € L, then ||X —I,X||%- ~ y7(k), where k = dim V — dim L = d—™m. 


Proof. From Remark 2.20 we have that II, = II; implying that 
(1-11, )=0% = (7 -—T1,)I,» = (14 = 11,)= = (yp —0,)==0 


where we have used that II; is idempotent. Lemma 2.22 now yields that I1;_X and 
X —II,X = (J —II,)X are independent and since II, and J — IT, are linear, they 
are also normally distributed. This establishes (a). 

The covariance for II, X is 


V(X) = 1,5, = 175 = 1d 


and by analogy we get V(X — I, X) = (I — I,)® which yields (b). 
But then K satisfies for u € L = rg(IIz) 


(Ku, U,uv) = II, Ku, Sv) = (KT zu, Sv) = (Ku, Sv) = (u,v) 


showing that /¢ is a concentration for II, 4. The calculations for X —I1;X are again 
analogous; hence (c) is established. 

If € € L we have the mean of X — II, X is € — I1z€ = O and hence it holds 
that the residual X — IX is distributed as (0, (J — Hz)=). Now choose a basis 
€1,..-,é€q for V that is orthogonal for the concentration inner product and where 
€1,---,€m form a basis for L and thus e,,41,..., eq a basis for L+. Then 


k d 
|X -UrX|lk =>) > 22, = SS Z? 


i=l i=m+1 
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where Z; = (X,e€;)x« are the coordinates in this basis. But we have for the covari- 
ances of Z; 


V(Zi, Z;) = V((Ke;, X), (Ke;,X)) 
= (Ke, Ke;) = (Kes,e7) = Mere re 
Since (e;,e;)k = Ofori # j and (e;,e;)K = 1, we get that Zy41,...,Z4 


are independent and standard normally distributed whence Definition 2.9 yields that 
|X — Ip X||% ~ x?(k), where k = d— m= dimV — dim L, as required for (d). 
This completes the proof. 


Remark 2.25. Note in particular the conclusion in Theorem 2.24(c), saying that 
although the covariance for the projection I1,X is Il, and thus differs from &, 
the concentration K is still a valid concentration; here one should bear in mind that 
only the restriction of K to L matters, cf: Proposition 2.16. 


We illustrate the use of the decomposition theorem in a very simple example that 
is both classical and in some sense generic. 
Example 2.26. Consider V = R” with standard inner product (u,v) = u' v and the 
normal distribution Vy (£,07/,,). Assume € € L where 


L={€eR"|i=&=---=& =n, wER} 
is the one-dimensional subspace of R” where all coordinates are identical, i.e. the 
space spanned by the constant vector c = (1,...,1)'. From (A.2), the orthogonal 


projection II; X of X onto L is determined as 


(X, c) XTe 2 vin Xi = = 


ix = Coa, eee. Ge 


lel2 cle hn 
Similarly, the residual X — II, X is 
XMS (NS Xr 


and Theorem 2.24(a) now yields that X and the residual (X, — X,...,X, — X)" 
are independent. 
The matrix for Il, is E/n where F is the matrix with all elements equal to 1: 


1 il 1 

1 il 1 
E= 

1 i 1 


Hence from Theorem 2.24(b) we deduce that the covariance matrices for II;_X and 
X —II,X are 


: 1 
V(X) = B, V(x -LX) =0? (:- +) . 
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Ee 


ILX 
Tl’ xX 


Figure 2.2 — Visualization of Theorem 2.27. If X ~ Nv (€,™) with€é e L'’ CL CV, 
1X = X —T1X,MW2X = ILX —II'X, and T3(X) = II’X are all independent if 
projections are orthogonal with respect to (-,-)«. Further, since € € L’, the squared 
norms ||_X — ILX||} and ||ILX — II’ X'||% are y?-distributed with degrees of freedom 
equal to dim V — dim L and dim L — dim L’ respectively. 


Note that both of these covariance matrices are singular, reflecting that Il; X is sup- 
ported on the one-dimensional subspace L, and X — II;X is supported on the or- 
thogonal complement L+. Still K = o~7J,, is a concentration matrix for both of 
these distributions, as shown in Theorem 2.24(c). Using this fact, we get from The- 
orem 2.24(d) that 


1< . 
|X - TX || = = D(X: — x) 
t=1 


follows a y?(n — 1) distribution since dim V = n and dim L = 1. 

There is a natural generalization of the decomposition theorem to multiple sub- 
spaces. A geometric illustration of this decomposition theorem is given in Fig. 2.2. 
An orthogonal decomposition of V with respect to an inner product on V is a system 


II;,2 =1,...k of k projections onto mutually orthogonal spaces L;,i = 1,...,k so 
that 
k 
Hjl;=0fori4 9, I= > Th 
i=1 


or, in other words, a decomposition of V as 
V = Ty @ aoe io) Ly 


into mutually orthogonal subspaces. We then have 

Theorem 2.27. Let X ~ Ny (E,%) on (V,(-,-)) where dimV = d and assume 
that & is invertible with concentration K = ~'. Further, let L,,...,L, denote 
an orthogonal decomposition of V with respect to the concentration inner product 
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(-,:)«, and Il,,..., 1, the corresponding projections. Let I denote the identity on 

V. Then the following hold: 

(a) The projections I1;X,i = 1,...,k are mutually independent and normally dis- 
tributed; 

(b) The covariances of II;X are I;%,i=1,...,k. 

(c) K is a concentration operator for all of U,X, i =1,...,k. 


(d) If I1;€ = 0, then ||11;X||7, ~ x?(di), where d; = dim L;. 


Proof. The proof is completely analogous to the proof of Theorem 2.24 and therefore 
omitted. 


2.3. The linear normal model 
2.3.1 Basic structure 


The linear normal model is obtained by assuming X to be normally distributed on 
a d-dimensional Euclidean vectorspace (V, (-,-)) with mean € in a linear subspace 
L C V and covariance 07%, where © € L(V,V) is a fixed and known positive 
definite and self-adjoint map. Thus X has density 


(det K)\/? Ile 12, 
feo) = Gagnaee ™ 


with respect to standard Lebesgue measure \v on V; here K = )~? is the concen- 
tration of the distribution and the squared norm in the exponent is using the concen- 
tration inner product determined by kK. 

This model has representation space (V,B(V)), parameter space L x Ri, and 
the family of probability measures is 


P = {Ny(E,07X) | (€,07) € L x Ry}. 


We may without loss of generality reparametrize the model by using (-,-) « as 
the base inner product on V, leading to ) = K = J and the density 


1 =lle—e1? 
fe,o2)(@) = (Qn02)4/2° 


with respect to ‘standard Lebesgue measure on V’, the norm ||-|| now referring to the 
new base inner product (-, -);. We shall later, in Chapter 3, see that the family is also 
smooth and stable so that Bartlett’s identities hold for the score and information; see 
also Section 2.3.2 below. 

Special instances of this model include linear regression, models for comparing 
means, analysis of variance, and many others. And from time to time we shall con- 
sider the submodel where also o? is fixed and known. 


Example 2.28. [Single normal sample] Consider X = (Xj,...,Xn)' where 
X1,...,Xn are identically distributed as N(,07). This may be represented as a 


THE LINEAR NORMAL MODEL 41 


linear normal model by considering X as an element of V = R” with standard inner 
product (u,v) = u'v and mean € = (,..., 2) ' an element of the one-dimensional 
subspace 


D={€EER | =&=-:-=& =p, we R} 


as in Example 2.26 above. Then the model can either be parametrized by the general 
parametrization (€,07) € Lx R4, or by (1,07) € Rx R, using the correspondence 
wer €=(u,...,h)". 
Example 2.29. [Simple linear regression] Another special instance of the linear 
model is that of simple linear regression. We consider X1,..., Xq to be independent 
and normally distributed as 


X,~ N(a+ ti, 07), i=1,...,d, 
where t = (1,...,tq) are known real numbers and a, € R and o? € R, are 
unknown. This is a general linear model with V = R%, standard inner product, and 
mean € € L, where L is the two-dimensional subspace of R@ determined as 


L={€ €R*|& =a+ Bt, (a, 8) € R?} = A(R’). 


gis 1 gan, <3 
ee a: Ryd 


We may make a reparametrization by parametrizing L by the intercept a and slope 
B of the regression line. The parameter space then becomes R? x R.. Note that A 
has full rank 2 unless all ¢; are identical. 

If A has full rank, the vectors u; = (1,...,1)' and uz = (t;,...,tg)' forma 
basis for L or, alternatively the orthogonal set v1, v2 where uw; = v; and 


Here A is the design matrix, 


vo = (ti —t...,t¢-)" 


with ¢ = (t; + --- + ta)/d. It now follows from (A.2) that we have 


d 
= ; a(t; —t 
TX = fo, + Det Xl 9),, 

ini (ti ae 


with the ith coordinate 


(U,X); = xX 


(: ae = ») is Sin 


We may also express this in matrix form using Proposition 2.21 as 


-1 
d ot 

T,X = A(A'A)1A'X=A Qi Aax 
yi bi Dae t? 
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leading to the singular covariance matrix for Il, X 


by bi Des t? 


= a ey ~~ Ss tj TT 
- | a 


-1 
d Jt; 
V(ULx) = ne = ot, = 04 ( di AT 


and similarly for the residual X — I, X. 
Note that the wuth element c,,, of this singular covariance matrix C' is given by 
the expression 


Se ee SE) 2 gee) 
Cie = 5 = ++ 

Diilts — #) d (1? — (#)?) 

and despite this apparently complicated expression, it is still true that K = o~7I, 
is valid as a concentration matrix for II;_X, see Remark 2.25. Indeed, we have the 
alternative expression for C’ as 


where v1 and v2 are the orthogonal basis for L as found above. 


In a linear model we are typically interested in the mean € € L, whereas the 
variance a” appears as a nuisance parameter; we need to take a? into account as it 
determines the precision with which the mean may be determined. But other times 
our focus could be on the variance rather than the mean, as in the example below. 


Example 2.30. [Double measurements] Suppose we are interested in determining 
the precision of a measuring instrument. We may then collect n units and measure 
each unit twice, resulting in 2n independent observations 


Xi, Yi ~ N(&,07), t=1,...,n 


where €; € R is characteristic for the ith unit and o? is the precision of the instru- 
ment. This is a linear normal model on V = R” x R” with the usual inner product 
and mean € € L, where 


L= {(u,v) € R® x R" | uj; = y,t=1...,n} 


Here, the parameter of interest is ¢(€,07) = 07, whereas € = (€1,...,€,) is only a 
necessary disturbance, and thus a nuisance parameter. 
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2.3.2 Likelihood, score, and information 


Consider now a linear normal model given as X ~ Ny(€, 07%) with © known and 
€ € L, and let us for simplicity assume that also o? is known. To calculate the log- 
likelihood, score, and information, we shall assume that V = R¢@ with standard inner 
product and the linear space L is determined by a design matrix A € R?** where A 
has full rank k < das in Example 2.29 so that € = AO with 0 € © = R*, ie. the 
linear subspace is simply L = A(R") and the parameter space is 0. 

We get for the log-likelihood function (ignoring constant terms) 


llc — Ad||%  (w— AO)" K(a — AO) 


£2(8) 202 7 20? 
_ ela ; z'KAQ @6'A'KAO 
7 Qq2 | o? 202 


and differentiate to find the score function 


a'KA @'A'KA  (a—AO)'KA_ 1 


2 2 2 o2 


S(a, 0) = (x — AO, A)x. (2.9) 


oO oO oO 


Further, when we change sign and differentiate further, the information function be- 
comes 


I(x, 0) = SAKA = i(0) (2.10) 


which is constant in x and therefore equal to the Fisher information. In this case, 
the Fisher information is also constant in 9. Also, note that the Fisher information is 
inversely proportional to the variance factor o. 

As mentioned, we shall show in Chapter 3 that linear normal models are regular 
exponential families and therefore smooth and stable, so Bartlett’s identities hold. 
But here it can also be seen directly: inspecting the expression (2.9) for the score 
yields 

se 
E{S(X,0)} = p{/G- nt =0 


o2 


and 
1 
E{S(X,0)"S(X,0)} = —E{A'K(X — A6)(X — Ad)'KA} 
o 
OP = 1 jt . 
o a 
The quadratic score becomes 


Q(X,0) = S(X,0)i(0)-1S(X, 6)! 
(2 — A0)' KA(A' KA)~1A' K(a — AO) /o? 
= («—A0)'KH (a — Ad) /o? 
( 


X — AO, H(X — Ad))x/o? = ||L(X — A9)|lkjo2- 


44 LINEAR NORMAL MODELS 


since the somewhat lengthy expression 
H = A(A' KA)"1A'K = A(A'(K/o?)A)“1A! (K/o?) 


is the matrix for the orthogonal projection onto L with respect to (-,-) «/,2, also 
known as the hat-matrix; see Proposition 2.21. 

As we have E(X — A@) = 0, Theorem 2.27 implies that the quadratic score 
Q(X, 0) is exactly x?(k) distributed in the linear model with fixed covariance. This 
should be compared to Theorem 1.25 and Corollary 1.35. 


2.4 Exercises 


Exercise 2.1. Let X ~ N3(€,%) where 

3 0 

8 0 

0 1 

a) Find the marginal distributions of (X,, X2)' and (X,, X3)'. 
b) Find the distribution of (X2,X3,X1)'. 


c) Find the distribution of Y = X; — X2+ X3. 
Exercise 2.2. Let X ~ N3(€,) where 


1 4 3 1 
€=|]-1], 4=]3 8 1 
2 1 1 2 


a) Argue that X has a density with respect to Lebesgue measure on R®. 
b) Find the distribution of 
Y= X, — Xo+2X3 
~ Dy Ne 


c) Find the concentration matrix for the distribution of Y. 
Exercise 2.3. Let X ~ No2(€,%) where 


(=O) 


and define Y € R? by Yj = Xy — X, and Yo = X, —- 1. 
a) Argue that X has a density with respect to Lebesgue measure on R?. 


b) Find the concentration matrix K of X. 
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c) Write an expression for the density, and sketch the level curves. 
d) What is the distribution of Y? 
e) Are Y; and Y2 independent? 


f) Give an expression for the projection onto L = {x € R? : 2, = 2x2} with 
respect to the concentration inner product (-, -) x. 


Exercise 2.4. Let X = (Xi, X2)! follow a normal distribution on R? with expect- 
ation 0 and covariance matrix 


_f a af2 
X(a) ~~ (", 4 ’ 


a) For which values of a is (a) a valid covariance matrix? 


where a € R. 


b) For which values of a is the distribution regular? 

c) Find the distribution of Y = (Yi, Y2)' = (X1 + X2,X1 — X2)! and show that 
Y; and Y2 are independent. 

d) Find the distribution of Z = Y?/3Y?. 

Exercise 2.5. Let X ~ N3(0,J3) and define Y = a + BX, where 


1 1 -2 1 
a=/}0], B=]1 1 1 
—1 2 -1 2 


a) What is the distribution of Y? 
b) Which of the pairs (Yi, Y2), (Yi, Y3), and (Yo, Y3) are independent? 


c) Show that the distribution of Y is singular and identify the support of the distri- 
bution. 


Exercise 2.6. Let X ~ N2(0, =) where 


1 2 
v= , 
2 4 
a) Show that the distribution of X is singular. 
b) Identify the support of the distribution. 


c) Find all valid concentration operators Kt for X. 


Exercise 2.7. Let X = (X1, X2,X3)' follow a normal distribution on R® with 
expectation 0 and covariance matrix 


where a € R. 


46 LINEAR NORMAL MODELS 
a) For which values of a is D(a) a valid covariance matrix? 

b) What is the distribution of Y = 2X, — X3? 

c) Show that the distribution is of X is singular if a = 2(./2 — 1). 

d) Find the support of the distribution of X for the case a = 2(./2 — 1). 

e) Determine the set of valid concentration operators for 5(a) when a = 2(\/2—1). 
Exercise 2.8. Let X ~ N3(€, 07/3) where 


a+b 


Further, let L be the linear subspace of R? determined as 


1 1 
DL = span 1],] 0 
0 1 


a) Determine II ;_X and find its distribution. 


b) Find the distribution of ||X — I1,X||?, where ||-|| denotes the usual Euclidean 
norm on R?. 


Exercise 2.9. Let X ~ N3(0,u,) where —1 < p < 1 and 


a) Show that the specification above defines a regular normal distribution on R°. 


b) Show that the concentration matrix K,, is given as 


1 —p 0 
Kp = f-~ |? ee ai! 
0 —p 1 


c) Let now L be the subspace spanned by the constant vector e = (1,1,1)'. De- 
termine the projection II,.X onto L with respect to the inner product (-, +) x,. 


d) Determine the distribution of II, X and _X — II,X. 


Chapter 3 


Exponential Families 


3.1 Regular exponential families 


There is a large and important class of statistical models that have a common math- 
ematical structure which make their analysis particularly simple. These are models 
where the associated family ? of probability measures is a so-called exponential 
family. Such models are also known as exponential models. We define: 


Definition 3.1. An exponential family of probability distributions on (4, E) is a 
parametrized family P = {Pg |@ € O} of the form 


P,(A) = i oo d(x), (3.1) 


where V is a k-dimensional Euclidean vector space with inner product (-,-), @ C V 
is the space of canonical parameters, t : X — V is the canonical statistic, and | is 
a o-finite base measure of the family. 


Remark 3.2. We note in particular that an exponential family of measures is dom- 
inated by the base measure {1 and the form of the density implies that the support of 
the measures Po is the same for all 0 € ©. 


We shall in the following assume that the exponential family P is minimally 
represented and regular which means that 


minimal There is no (\,c) € V \ {0} x R with (A, ¢(x)) = cae. p. 
regular The parameter space O is an open and convex subset of V. 


Then k = dim V is the dimension of the exponential family. Since P9(4’) = 1, we 
must have for all @ € O: 


(0) = | 9 @) dul) < 00 G2) 
x 


so c(@) is a finite normalizing constant. The family is said to be full if it is also as 
large as possible: 


ee {9 EV | c(9) = a 22) du(a) < co} . 
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In much standard literature, the term ‘regular’ incorporates that the family is also 
full. However, in this text we shall not be concerned with fullness. It is practical to 
introduce the function 


(8) = log c(@) 
which is known as the cumulant function of the exponential family and we may then 
alternatively write the density in (3.1) as 


fo(x) = e(9:t(@))—¥ (8) (3.3) 
and the associated log-likelihood function as 


€(8) = (6, t(a)) — o(8). 


In the following we often assume that an orthonormal basis for V has been chosen 
so that we without loss of generality can identify V with R* and write the inner 
product in terms of coordinates as 


(0,t(x)) = 07 t(z). 


3.2 Examples of exponential families 


Many of the statistical models we have seen in the previous chapters are indeed 
exponential families. In each case, however, a little reformulation is needed to see 
that this is the case. 


Example 3.3. [Bernoulli model as exponential model] Consider the Bernoulli model 
in Example 1.4 with densities for . € (0,1) given as 


fulz) = rage fT) cama LE {0,1} 


with respect to counting measure on VY = {0,1}. To identify this as an exponential 
family, we introduce the parameter 


iv 
Le 


6 = log 


representing the /og-odds of the distribution and note that we then have 


= e? _ 1 
=. qepge” ee 14 


m 


so we may rewrite the density in terms of this parameter as 
Ox 


e€ 
= >>; € {0, 1}. 

fola) =<], ve {0,1} 

When the parameter ju varies in the unit interval, the log-odds 0 vary in all of R. Thus 

we may identify the Bernoulli family as an exponential family with the following 


characteristics: 
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base measure counting measure on {0, 1}; 
canonical parameter 0 = log(y/(1 — w)) € 0 =R; 


canonical statistic ¢(x) = x; 


normalizing constant c() = 1+ e°; 


cumulant function 7(0) = log c(@) = log(1 + e°). 
The family is regular since O = R is an open subset of R and if \- 1 = A- 0, we 
must have = 0 so it is also minimally represented and the dimension of the family 
is one. 

Note that we could also have represented the Bernoulli model with the canonical 
parameter 6 = (61,62) ' € R? with 


6, =logp, 42 = log(1— p) 


and 


since then 


fo(@) = w*(1— p)'* = exp{O'E(a)}, x € {0,1}, 


making the family appear to be two-dimensional. However, in this form, the family 
is over-parametrized and the representation is not minimal since ¢,(a) + f(x) = 1 
for all z € {0,1}. 
Example 3.4. [Poisson model as an exponential model] The simple Poisson family 


in Example 1.2 is an exponential family. To see this we reparametrize to 6 = log » 
where is the mean of the distribution and write 


Ox Ox 
e A st6 e’* 1 
fo(z) = ~e* == 


a! ef a!’ 


identifying the Poisson model as a one-dimensional exponential model with 


base measure = 4 -m, where m is counting measure on No; 


canonical parameter 0 € O = R,; 
canonical statistic ¢(x) = x; 
normalizing constant c(@) = ee: 
cumulant function 7)(9) = log c(@) = e°. 


The model is minimally represented as XX is constant if and only if X = 0 and it is 
regular since O = R, is an open and convex subset of R. 

Note that if we consider the variant with X,,...,X,, independent and identically 
Poisson distributed, we have the density 


On. ae a] 


> 0 ’ 
x;! ene x! 
¥ i=l 
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so this is again an exponential family with the same parameter space, but base 
measure ju", canonical statistic t(@1,...,2%n) = 5°; xi, and cumulant function 
Yn(0) = nw(0). As we shall see below in Theorem 3.11, this is a general phe- 
nomenon for exponential families and a major reason for their importance. 


Example 3.5. The simple normal model in Example 1.5 is a regular exponential 
model. To see this, we first rewrite the density for a single observation as 


1 _(e=s)? 1 22 we 8? 1 ve 2 
f(x) = e 202 = C€ 2025 ot 22 = €or 22 


V2r0? V2n0? ont Jina? 


We next reparametrize and introduce the parameter 0 = (0}, 02) 


1 
=5, = with £= 61/62, 07 =1/6, 
oO 


and let 


2 6? D) 
c(0) = e2e? Vimo? = e783 ,/—, 


92 


With these quantities the density may be rewritten as 
eft (—a? /2)+0ox 


fo(x) = <(6) 


When w varies inQ = R x R,, 6 varies in all of © = R x R, which is an open and 
convex set. We thus have a two-dimensional regular exponential family with 


base measure standard Lebesgue measure on R; 

canonical parameter 6 = (0),02)' = (€0~?,077)' € @; 

canonical statistic ¢(x) = (x, —x?/2)'; 

normalizing constant c(@) = exp(67/(202))\/27/0o; 

cumulant function ~(9) = log c(@) = 67/(202) + $log(2m) — 4 log 62. 

The family is minimally represented for if we assume that | t(X) is almost every- 
where constant, we have 


1 
zr — 5 At”? +c=0 almost everywhere, 


but this is a polynomium of degree at most two so must be identically zero. 

As seen from the above, identifying the simple normal model as an exponential 
model is a bit involved and may not in itself be so helpful; but it will be import- 
ant when considering related models. We refrain from given the details here of the 
extension to the case of n observations. 


Example 3.6. [Linear normal model] Consider the linear normal model as discussed 
in Section 2.3. This corresponds to the family of densities 


Fegan) => (Qo)? ° 202 
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with respect to standard Lebesgue measure Av on (V, (-,-)), where (€,07) € Lx Ry 
for L being a linear subspace of V. 
To see this is an exponential family, we rewrite the quadratic term in the exponent: 


j7—el? _ ffeil? WIPE) 


202 20? 20? o? 
g IIcll? _ llgil? 
=> —— I _ _— 
( a (2) 20% 0a? 
where II; is the orthogonal projection onto L, and we have used that for any € € L 


it holds that (€,2) = (€,IIz(a)). Letting now 0, = £/07, 02 = 1/07, t(x) = 
(Iz (x), —||x||?/2), and introducing the inner product (-,-)rep as 


(9, t(x))rep = (01, t1(x)) + Oate(x), 
we can write the density as 


fo(x) = exp{(9, t(x) rep — H(A) } 


with the cumulant function being given as 


|||? 
202° 


w(0) = —loge— “log 65 + (3.4) 
This is a direct generalization of what we found in Example 3.5. Thus we have 
represented the general linear model as an exponential family with base measure 
Ay, canonical parameter space 0 = L x R, C L x R, and canonical statistic 
(x) = (Hz («), —lla|2/2). 

For the sake of completeness, we conclude this section by identifying some fam- 
ilies that are not exponential families in the sense we have defined here. 


Example 3.7. [Models that are not exponential] We first consider the uniform model 
in Example 1.8. This is not an exponential family since the support of the uniform 
density is [0,0] and thus depends on the unknown parameter 0, contradicting Re- 
mark 3.2. 

Further, also the Cauchy model is not an exponential model. Although this is not 
so easy to show formally, there is no way that the family of densities 


: B 
m((@— a)? + 82)’ 


for a, 2 € R x Rx can be represented in the required form. 

Finally, certain subfamilies of regular exponential families are not necessarily 
regular exponential families themselves. Consider, for example the subfamily of the 
normal family determined by the relation € = ¢g, i.e. that the standard deviation is 
equal to the mean. In terms of the canonical parameters (01, 02), this is the submodel 
given by 02 = 67 and the problem is that the set 


ceER 


fa, (2) 


Oo = {0 € © | 62 = 07} 
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is not an open subset of ©. We shall later—in Section 3.6—look at more general 
families, known as curved exponential families and show that such families share 
many—but not all—properties with those that are regular. The model determined by 
Qo is an example of a curved exponential family. 


3.3. Properties of exponential families 


Exponential families have many nice properties, including the following important 
collection of results: 


Theorem 3.8. Let X be a random variable and assume that X follows a distribution 
from a k-dimensional regular and minimally represented exponential family with 
representation space (XIE), canonical parameter space © © R*, canonical statistic 
t(X), and base measure js. Then t(X) has moments of any order. 


Proof. For @ € O and @ +h € O we have for all n € N that 
Eo{|h' t(X)|"} < 00 


by Lemma B.1. Choosing h = (hi,..., hx) so that hj = e andh; = 0 forj #2 
yields that for ¢ sufficiently small it holds for all n € N that 


Ep {e”|ti(X)|"} < co 


and hence all moments of t(X) are finite. 


Theorem 3.9. In the same situation as specified above, the normalizing constant 
c(9) is a smooth function of 0, and we have 


ee ‘lia 6! t(x) ie 
Bor og =f Taw rer du(z) {Trax 
for all m,,...,mx € No. 


Proof. The relation follows from Lemma B.2 since this allows to switch the order of 
differentiation and integration (Schilling, 2017, Theorem 11.5): 


OMT HM OM tM / 


pe, ee aT tx) g 
Br.) oem ...a9m™ Jy © H(z) 


Omit tM iS 
= aaa 
x JOT" +--+ 00; 


k 
[Tle auto) 
X j=1 


k 
= c(0)Eo {I nah ; 


as desired. 
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And, importantly, an exponential family satisfies the regularity conditions needed 
for score and information to be well-defined. 


Theorem 3.10. A regular and minimally represented exponential family is smooth 
and stable. 


Proof. Since an exponential function is smooth and the normalization constant c(@) 
is smooth by Theorem 3.9, we just have to establish stability. For such an exponential 
family, we get 


Deas deen CMD) 
Fobaln) = (n(e) - 2) G.5) 
where c;(7) = Oc(n)/On;, and further 
ze x fp UNV eel) en” 
aon a (ue oy) (4@) i) c(n) 
ex(n)e;(n) — ciy(n)\ e7 *) 
+( e(n)? a) en) a 


where c;;()) = 0?c(n)/OnOn;. For any m = (my,...,m%) with m; € No, 
Lemma B.2 yields that 


k 
[] a(xye7"%)] < hole) 


iat 


for all 7 in an open neighbourhood Us around 0 where hg is pi- integrable. Also c(7), 
ce(7)~*, |ci(m)|, |e; (7), and |c;;(7)| are all bounded above in such a neighbourhood. 
Thus, the derivatives in (3.5) and (3.6) are locally bounded by integrable functions, 
as desired. 


3.4 Constructing exponential families 


There are numerous ways of constructing new exponential families from other expo- 
nential families. We mention a few of the simplest here. 


3.4.1 Product families 


Consider an exponential family, P = {P9|0 € O}, having representation space 
(¥, E), base measure ju, and canonical statistic s : ¥ 1H IR*, and another exponential 
family, Q = {Q¢|€ ©€ =}, with representation space (V, F), base measure v, and 
canonical statistic t : Y +» R™. We may then form the outer product family 


PO©Q={P,@Qz|(0,é) € Ox Eh. 


This again becomes an exponential family, and if each of the original families were 
regular and minimal, so is the product family. Indeed, the representation space is 


(# x Y,E@F) ¢ (R*™",B(R**™)), 
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the base measure is {4 ® v, the canonical parameter space is O x ©, and the canonical 
statistic is 
k 
ur kx YeR™  u(z,y) = (s(x), t(y)), 


so the dimension of the model is k + m. The corresponding family of densities with 
respect to 44 ® v may now be written as 


—_ 1 67 s(x) 467+) 
fe) aR : 
where c(@) and d(£) are the normalizing constants in the two families considered. 
This construction corresponds to the situation where we simply have two com- 
pletely independent and unrelated statistical models, but we wish to consider them 
together for some reason. We note that the log-likelihood function for the product 
model is simply the sum of the log-likelihood functions for the factors 


Coal Oe) = 2, (0) + (8). (3.7) 


A different form of product occurs when the parameter spaces coincide, i.e. when 
= = @ and the canonical statistics s and t map into the same space R*. We may then 
form the direct product family 


P&®Q={P,®Q|0€ 0} 


which now representation space (7 x Y,E @ F), base measure is pp ® v, canon- 
ical parameter space O, and canonical statistic is u(x, y) = s(x) + t(y) since the 
corresponding family of densities with respect to pp ® v may be written as 


1 


of" (s(x) +4(y)) 
c(9)d(9) 


fo(x,y) = 


where c(#) and d(@) are the normalizing constants in the two families considered. 
Again, if the two families are regular and minimally represented, so is the direct 
product family and the dimension of the direct product is the same as for the con- 
stituents and equal to k; the log-likelihood functions are again obtained by simple 
addition 

lx y (8) = C2 (0) + C,(8). (3.8) 


This construction corresponds to combining information from two independent 
statistical experiments to obtain inference on the same common parameter 0. Sec- 
tion 8.3.1 provides examples of these product constructions and their use. 


3.4.2 Repeated observations 


An important instance of the direct product above corresponds to independent repe- 
titions. More precisely, consider a sample X1,...,X,, from an exponential family P 
on the representation space (,E) with canonical parameter space © and canonical 
statistic t. 
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The model corresponding to repeated observations will be the n-fold direct 
product and have representation space (1’”, E®”) and associated family 


pen — {P2" 9 € O}. 


Theorem 3.11. Assume that P is a minimally represented and regular exponential 
family as above. Then the n-fold direct product P®" is a regular exponential family 
with the same parameter space and base measure t1®". The canonical statistic ty, is 
determined as 

tn(@1,---,Un) = t(a1) +--+ + tap) 


and the normalizing constant cy, and cumulant function ty, satisfy 
en(8) = c(8)", Un (9) = nv(9). 
Proof. The density of P?” with respect to u®” is for x = (71,...,@n) 
Pn(2) = I sate ated) = age edt HO) aye (a), 
which yields the relations mentioned since 
Wn(0) = log cn (0) = nlog c(@) = ni(). 


We need to establish is that the family is minimally represented. But assume for 
contradiction that (A, t,(X )) is constant almost surely. Then, by independence, 


0 = Vo{(A, tr(X)) } = nVo{(A, t(X1))t 


which yields (A, ¢(X1)) constant, contradicting that the family P was minimally 
represented. 


3.4.3 Transformations 


In general, if an exponential model has been set up for a random variable X, and 
Y = s(X) is a function of X, the family of distributions of Y will typically not 
be an exponential family. An exception is when X is transformed by the canonical 
statistic, so that Y = t(X). This leads again to an exponential family with the same 
parameter space and the lifted base measure. We formulate that as a theorem. 

Theorem 3.12. Assume that P = {P9|@ € Q} is a minimally represented and 


regular exponential family with canonical parameter space ©, canonical statistic t, 
and base measure 1. Then the family P of distributions of Y = t(X) 


P = {Qo|0€ O} 


where Qo = t(Po), is a regular and minimally represented exponential family with 
canonical parameter space ©, base measure v = t(1), and canonical statistic t : 
R* ++ R* given as t(y) = y. The cumulant function is unchanged. 
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Proof. This follows directly from using the transformation formula ensuring that 


Qo(A) = Po{t-*(A)} = aC) =| elt YO) (pu) (dy). 
t-1(A) A 


Preservation of the other properties is obvious. 


This fact is typically exploited in connection with repeated observations, where 
the observations X,,...,X, are replaced by the sum Y = ¢(X,) + ---+t(X,). 
An exponential family with the identity t(y) = y as canonical statistic is known as a 
natural exponential family (NEF). 


Example 3.13. [Binomial model] Consider the Bernoulli model with X,,...,X» 
independent and identically distributed as 


Pies) = ia x € {0,1} 
represented as an exponential family using the parameter 0 = log{y/(1 — )} with 
densities 
earcee 
fo(@1,---,In) = “+e 


with respect to counting measure on {0, 1}". We may instead consider the distribu- 
tion of the sum Y = X; +---+ X,, directly yielding the binomial family 


w= (ati 


having density 


ey 
= aaa 
a (+e) 
with respect to v = (7) -m where m is counting measure on Y = {0,...,n}. Thus, 


the binomial family is an example of a NEF. 

If we use the canonical statistic to transform the observations, the log-likelihood 
function of the transformed family is identical to the original log- likelihood function 
since when y = t(z) 


(8) = 0 y — (8) = Ota) — VA) = £46). 


So any inference based on the log-likelihood function is unaffected by the trans- 
formation y = t(). This property is known as sufficiency of the transformation t; 
no information about 6 in x is lost by using y = t(x) instead of x. The notion of 
sufficiency is yet another fundamental concept introduced by R. A. Fisher, but the 
concept will not be discussed in detail in this book. 


3.4.4 Affine subfamilies 


We next consider exponential families given by assuming that the canonical para- 
meter is contained in a subset Oo C O of the form 


05 =ON(L+a), 
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where L C V isa linear subspace of V of dimension dim LZ = d anda € V or, in 
other words, the parameter is restricted to an affine subspace L + a of the original 
parameter space. To ensure this is interesting, we shall assume that Oo is not empty. 
We now have 


Theorem 3.14. Assume that P = {Po|@ € QO} is a minimally represented and 
regular exponential family with canonical parameter space © and Og is a non-empty 
affine subset of © as above. Then the subfamily 


P = {Po4.|0€ O} 


is an exponential family with canonical parameter space © = Oo — a C L. The 
canonical statistic t : X ++ L may be represented as t(x) = I1,(t(x)), where Iz, is 
the projection onto L. The base measure and cumulant function is 


jude) = *) u(dx), (8) = (8 + a) 


and the family P is a regular and minimally represented exponential family of 
dimension d = dim L. 


Proof. We consider the exponent in the representation of the larger family and have 
for 0 € Op and thus0=@0-—aeéL: 


(9,t(a)) = (8—a,t(a)) + (a, t(x)) 


(0 — a, Uz (é(x))) + (a, t(2)) 
(9, #(x)) + (a, t(2)), 


where ¢(a) = Iz, (t(x)). So if we let f(dx) = e%*) (dx) we may write the 
density gg of Qg = P51, with respect to ji as 


+a 


gg(2) = e#@))- 64a) 


Now, since © was an open subset of V, @ is an open subset of L so the family is 
regular. It is also minimally represented since for any \ € LD we have 


(A, (X)) = (A, Tn (t(X))) = (A, (x) 


and thus if (A,¢(X)) is a.s. constant with respect to ji, then also (A, ¢(X)) is as. 
constant with respect to jz, contradicting that P was minimally represented. 


Note that we could have represented the family with the canonical statistic t(x) 
instead of ¢(x) = II, (t(x)) since 


(0, t(x)) = (0, 1x (t(x))) = (6, t(x)) 


but this would then not be a minimal representation. 
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Example 3.15. Consider two independent Poisson random variables X and Y with 
expectations \ and jy, respectively, and corresponding canonical parameters 06 = 
log A and € = log py, as in Example 3.4. We first consider the outer product family 
which is the family of densities 
AX _ yp _(e8+e&) 1 
frp(t.y) = oa ee MH — efutty—(e' te a 

with canonical parameter (0,€)', canonical statistic t(a,y) = (x,y)', cumulant 
function ~(0,€) = e? + e§, and base measure equal to -m where m is counting 


aly! 
measure on N2. 

Next consider the submodel determined by . = 7. This in an affine (in fact 
linear) submodel of the outer product above, where 


L = {(0,§) |& = 26}. 


The space L is spanned by the vector v = (1,2)', so the projection onto L of 
t = (ti, t)' is given as 


ni = feth = he2 (2) 


2 
corresponding to the relation 
Ox + 26y = O(x + 2y) = (v, Ip (t(zx, y))). 


Alternatively we may simply represent the family of densities as 


x+2y 
r =X = eo (et2y)—(e? +e7?) 1 


aly! aly! 


Jo(x, Y) = fy,2(Z, Y) a 


showing that the new cumulant function becomes ¢)(9) = 1)(0,20) = e® + e?°. The 
corresponding model is an example of a log-linear Poisson model. 


In this example, the affine submodel was determined as the pre-image of a linear 
map, in casu as 


Oo) =A1(0), h(O,€) = € — 20. 


We shall also be interested in the case, where Oo is given as the image of an affine 
map, i.e. where Og has the form 


Oo = {a+ AB|B € B} (3.9) 


where B C R¢ is open and convex. We have the following variant of Theorem 3.14: 


Theorem 3.16. Assume that P = {P9|@ € ©} is a minimally represented and 
regular exponential family with canonical parameter space © and Og is a non-empty 
affine subset of © given as (3.9) where A is injective and B C R? is open and convex. 
Then the subfamily : 

PB = {PapsalB € BY 
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is a regular exponential family with canonical parameter space B. The canonical 
statistic t : X ++ R4 may be represented as t(x) = A*t(x), where A* is the adjoint 
of A. The base measure and cumulant function is 


jude) = e'")) (dx), (8) = Y(AB + a) 


and the family P isa regular and minimally represented exponential family of 
dimension d = dim L. 


Proof. We consider the exponent in the representation of the larger family as 
(AB + a, t(x)) (AG, t(x)) + (a, t(x)) 


= (6, A*t(x)) + (a, t(x)) 
(8, #(x))) + (a, t(2)). 


So if we let ji(dx) = e¢'(*)) w(dx), we may write the density gg of Pagya with 
respect to /4 as 


g(x) = (Pt) (AB +a), 


Now, since B was an open subset of R@, the family is regular. It is also minimally 
represented since 


(A, #(X)) = (AA, t(X)) 
and thus if (\,¢(X)) is a.s. constant with respect to ji, then also (A\,¢(X)) is as. 


constant with respect to 1, implying that AA = 0. But since A was assumed injective, 
this implies A = 0. 


3.5 Moments, score, and information 


One important feature of an exponential family is that the mean and variance of the 
canonical statistic may be calculated by differentiation of the cumulant function, and 
the score function and information has a particularly simple form, as we shall show 
in this section. 


Theorem 3.17. In a minimally represented and regular exponential family with 
canonical parameter space © and canonical statistic t, the cumulant function w(@) 
is smooth and it holds that 


Eo{t(X)} = V¥(9) = 7(8),  Vo{t(X)} = D’v(0) = Dr(A) = x(8). 


Further, the map 0 + 1(0) is smooth and injective, the map 0 ++ (0) is smooth, 
and (0) is positive definite for all 0 € ©. 


Proof. Theorem 3.9 yields that c(@) is smooth and hence 7)(@) = log c(@) is smooth. 
Letting 7(0) = Eo {t(X) }, we get from Theorem 3.9 that 


Vo). = eA = Boft(X)} = 110) 
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whence 7 is smooth as a derivative of a smooth function. 
Letting «(0) = D?2(0) yields that « is smooth; differentiating once more and 
using Theorem 3.9 yields 


ee au(0) 1 c(8) 1 Oc(0) Oc(0) 
= Eg{ti(X)tj(X)} — Bo{ti(X)} Bo {t;(X)} = Votti(X), t)(X)}- 
We first argue that «(0) is positive definite. Since Ne = Vo{t(X)} is a covari- 


ance matrix, it is positive semidefinite. Suppose there is a vector \ € R* such that 
\! SoA = 0. Then we must have 


Vo{A't(X)} = ADA = 0 


implying that \' t(X) is almost surely constant. But this contradicts that the family 
was assumed minimal. Hence “ig must be positive definite. 
Next we argue that 7 is injective. Consider 6,,02 € © with 6; 4 62 and let for 
a € [0,1] 
Oo = ad, + (1 ore a) Oo. 


Then 0. € © since O is convex. Next, define g : [0,1] + Ras 
g(a) = (61 — 2)" (Ba). 
Composite differentiation with respect to a € (0,1) yields 
g'(a) = (01 — 02) ' Dr(0q)(01 — 92) = (01 — 02)! &(Oq)(01 — 2) > 0 


since K(@.) is positive definite. Thus g(a) is strictly increasing on (0,1) and con- 
tinuous on [0, 1] so we must have g(1) > g(0), i.e. 


g(1) = (01 — 62)" 7 (61) > (01 — 2) "r(82) = g(0) 
which can be rearranged to give 
(01 — 02)" (r(01) — 7(82)) > 0 


whence we conclude that 7(@1) 4 T(02) so 7 is injective. 


We shall illustrate this in a few examples. 


Example 3.18. [Poisson moments] Consider the simple Poisson model, represented 
as a regular exponential family in Example 3.4 with canonical parameter 6 = log A 
as 


1 0a—e° 
fo(z) = Pes 


with cumulant function (0) = e®. We get by direct differentiation that 


Eo(X) = v'(0) =e =A, Vo(X) = "(0) =e® =X 


identifying \ = e? as the mean of the Poisson distribution. 
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Example 3.19. [Bernoulli moments] Consider the Bernoulli model as an exponential 
family in Example 3.3 with canonical parameter 6 = log(,:/(1—)), i.e. with density 


eft 
1+e%’ 


fo(x) =p7™(1—p)C- = x € {0,1}. 


Here the cumulant function is 1(0) = log(1 + e°) so we get for the moments that 


6 0 


=H, VolX)= 0") = a 


yz (1 — p) 


as we know well. 


Example 3.20. As another small example, let us consider the model in Example 3.15 
with two related Poisson distributions. Here we derived the cumulant function to be 
w(0) = e° + e?° and we thus get by differentiation that 

Bo(X + 2Y) = (6) =e? + 2e?? = 1+ 27, 

Vo(X + 2Y) = W"(8) = ef + 4c? = A+ 4? 


which we of course could also derive directly from the fact that X and Y are inde- 
pendent Poisson variables with means \ and ?. 


For a more complicated example, we consider the linear normal model: 


Example 3.21. [Moments in linear normal model] In the linear normal model we 
may also derive the moments from the cumulant function (3.4). We get 


_ O49) — Oui &:/0? 


E({Il;, (X)}i = 001; 5 1/0? &i, 


where we have identified 0, and € with their coordinates in an orthonormal basis for 
LI, and further 


auld d |@,|2 do? 2 
B{—|X1?/2} = 8) = IE ee 
2 


or, equivalently 
E{||X||?} = do* + |lg|l?. 


Differentiating a second time yields 


07w(0 Oi; 
Vue) ha= SoH ra 0 = obey 


where 6,; is Kronecker’s delta: 


1 ifi=j 
bij = fi 
0 otherwise, 
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and 
Ap(8) 1 _ —&/o? 
ee 2 [= — i — if — 2 3 
Vir (X):, |X| /2h<g 004,005 63 1/o+ oki 
as well as 
yO) d_ | ||) 
= 2 = = = do 21 ¢ 1/2 
V{—IX1?/2} = Sap = age + ogg = 44/2 + PI 
so that 


V{||X||?} = 20%d + 407|6\|?, 


is the variance of ||_X'||?. 


As a corollary, we obtain the following relations for score and information in 
exponential families. 


Corollary 3.22. In a minimally represented and regular exponential family with ca- 
nonical parameter space ©, the score, information function, Fisher information, and 
quadratic score satisfy 


Proof. This is obtained by direct differentiation. 


Remark 3.23. Jt is important to realize that the relations for the score and in- 
formation in Corollary 3.22 are specific to the canonical parametrization with © as 
parameter space. For an arbitrary smooth reparametrization, Theorem 1.30 or Co- 
rollary 1.33 must be used. The quadratic score is invariant under reparametrizations, 
but the others are not. 


For example, since the map 7 is smooth and injective, we can define a new para- 
metrization by letting 77) = 7(0) with parameter space C = 7(Q). This parametriza- 
tion is known as the mean value parametrization of the exponential family. Thus in 
this mean value parametrization we have from Theorem 1.30 that 


S(x,n)' = 5(0)~*(t(x) — 7(8)) = 6) “*(t(z) — 0)), 
i(n) = Vn S(X,0)"} = K(8) 7K (8) (0)? = w(8) 


In other words, the information about the mean value parameter n is the inverse of 
the information about the canonical parameter 0. 


3.6 Curved exponential families 


Sometimes it is too restrictive to assume that the parameter set O is open and convex 
to cover a range of cases of interest to us. Instead we consider the following. 
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Definition 3.24. A curved exponential family of dimension m and order k is a family 
of the form 

P= { Pog), 8 S Bh CQ= {Po,0 E 0} 
where 


a) Qis a k-dimensional regular and minimally represented exponential family with 
canonical parameter space O. 


b) B C R™ is open and m < k. 
c) ¢: B > Ois smooth. 


d) The Jacobian matrix 


(8) = Da(a) = { 55(3)} 


has full rank m for all 8 € B. 
e) dis ahomeomorphism onto its image. 
The larger exponential family Q shall be termed the ambient exponential family. 


We recall that ¢ is a homeomorphism onto its image if and only if ¢ is injective, 
continuous, and has a continuous inverse, i.e. if it holds that 


Jim 4(8n) = 68) <> lim By = 8. 


Figure 3.1 yields an example where ¢ is not a homeomorphism, and one that is. It 
is worth mentioning that all results following remain true if @ is just twice continu- 
ously differentiable rather than smooth. The proofs then just need to be adapted by 
using versions of the implicit and inverse function theorems that only assume twice 
continuous differentiability and have similar weaker conclusions. 


Remark 3.25. Note that if the ambient exponential family is reparametrized with 
new parameter space A and X = p(0) with p smooth and injective having a regular 
Jacobian, then ¢ : B — © satisfies the regularity conditions if and only if d =pod 
does. This means that we can verify whether a family is a curved exponential family 
in any valid parametrization. 


Figure 3.1 — The curve to the right does not satisfy the requirement of a homeomorph- 
ism, whereas the curve to the left does. 
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We illustrate the use of this remark in a simple example. 


Example 3.26. Consider the family of distributions on R?. determined by X and Y 
being independent and exponentially distributed with expectations E(X) = A; and 
E(Y) = 2, where \ € A = R2, ie. the family of distributions with densities 


eo t/a y/A2 


f(x,y) = 


Ai A2 


with respect to Lebesgue measure on Ri. As an outer product (see Section 3.4.1) 
of regular exponential families, this is a regular exponential family with canonical 
parameter 6 = (6),02)' = (1/A1,1/A2)' € O = R2. 

Now consider the subfamily determined by the relation \2 = A?. This is given as 
the image of the map ¢ : R, +> A determined as \ = ¢(8) = (8, 6?)'. The map ¢ 
is clearly an injective homeomorphism and it has Jacobian 


5 - 1 
J(8) = D6(B) = 
(8) lane 3 
which has full rank for all 8. The point here is that we may verify the regularity 
conditions without having to reparametrize using the canonical parameter 6 and the 
map 6 = ¢(8) = (1/8,1/82) ' into the canonical parameter space O. 


Clearly, a minimal and regular exponential family is itself a curved exponential 
family with m = k, B = O, and ¢(8) = B. 
Example 3.27. [Affine subfamilies] Another simple example of a curved exponential 
family is an affine subfamily as discussed in Section 3.4.4. There we considered a 
submodel determined by intersecting the canonical parameter space with an affine 
subspace, i.e. we considered 


0p = 9ON(L+a) 


where L is a subspace of R* of dimension m and a € R*. We may parametrize L 
by a linear map A : R™” ++ R* and this yields a curved exponential model with 
B=R"” and 

(8) = AB + a. 


Then ¢ satisfies the conditions above if and only if A = Dé¢ has full rank m, ice. if 
A is injective. 

Another instance of a curved family appears if the affine restriction is used on a 
parametrization which is not canonical. For example, if we consider the mean value 
parameter 7 = 7(0) € M = 7(0) and now restrict 7) to an affine subspace of M4 


Mo = Mn(L+a) 


as above. In most cases, this will not itself be a regular exponential family, but it will 
satisfy the conditions for a curved exponential family. 
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Example 3.28. [Bivariate normal with mean on semi-circle] A simple example of a 
proper curved subfamily of an exponential family is obtained by assuming 


xX ~ N2(0, Ia). 


If © = R?, this is a regular exponential family of dimension 2 with t(X) = X as 
canonical statistic, since we can write 


1 (e101)? + (e2—02)?)/2 

20 

1 0121 409%2—27 /2—2%3 /2—07 /2—03 /2 
=—_€ 

20 


0121 +0222 
Oe eye 
02+02)/2 On 


f(x; 0) 


We have then represented this family as a regular exponential family with base meas- 
ure NV(0, J) and cumulant function (0) = (07 + 63)/2. The log-likelihood func- 
tion, ignoring additive constants is 


1 
(0) = 0x — = log|lall’. 


Assume now further that the mean @ lies on a semi-circle in the right half-plane 


{6 :0, > O}: 
_ _ [cosB 
6 a $(B) ~~ bas ’ 


where 6 € B = (—7/2,7/2). Here ¢ is a smooth and injective homeomorphism 


with Jacobian 
_ preme 
1(8) = ( a 


which is never zero, hence has full rank. This is a curved exponential family of di- 
mension | and order 2. The parameter space is displayed in Figure 3.2, and the log- 
likelihood function now reduces to 


£(8) = o(8)' (3.10) 


since ||#() ||? = 1 for all 8 so the cumulant function becomes an additive constant 
and may be ignored. 


Another example is obtained by assuming the coefficient of variation to be fixed 
and known: 
Example 3.29. [Fixed coefficient of variation] Consider the model determined by 
X ~ N(6, 6?) where @ € B = (0,00) is unknown. In other words, the normal 
distribution is considered to have a fixed and known coefficient of variation 
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o(B) 


Figure 3.2 — The curved subfamily in Example 3.28 determined by the mean being on 
a semi-circle. 
This is a curved subfamily of the normal exponential family 
P = {N(Eé,0"),€ € Ro? > 0} 


with canonical parameters 


and hence the family considered is given by the map 


= BB-? _ gp 
ne = (Fe) ! 


The map ¢ is clearly a smooth and injective homomorphism with Jacobian 


1B) = Ga) 


having full rank m = 1 for alle 6. The curved parameter space is displayed in 
Fig. 3.3. 


We now have: 


Theorem 3.30. A curved exponential family is smooth and stable. 


Proof. The proof is essentially identical to that of Lemma 1.29 where a bijective 
reparametrization was considered. Theorem 3.10 ensures the larger family Q to have 
smooth and suitably bounded derivatives. Smoothness as a function of 3 now follows 
by composition of functions. We let L,(0) and L,,(3) denote the likelihood functions 
in the regular and curved family so that 
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(8) = (1/6, 1/67)" 


Figure 3.3 — The curved parameter space of the model in Example 3.29 determined by 
normal distributions with a fixed coefficient of variation. 


Using composite differentiation (Rudin, 1976, Theorem 9.15) yields 


OE(B) _ —~OL_(8) 06u(8) 


08; ~ <0, AB; 


or, in matrix form 


DL,(8) = DLz(4(8))D9(8), 


where D@(@) is the Jacobian of ¢. So the left-hand side is locally bounded by integ- 
rable functions if DL,.((3)) is. 


Differentiating a second time yields 


OL, (8) _ OL, (0) Abu(B) O¢y(8) oF AL, (8) 0? $u(B) 
OB0B; — 06,080, OB; — OB: — 90. AB:OB; 


which again is locally bounded by integrable functions if the derivatives on the right 
hand side are. L 


Proposition 3.31. The score, Fisher information, and quadratic score in a curved 
exponential family are given as 


S(x, B) = (t(x) — r(4(8)))" J(B),  i(B) = J(B)" x(0(8)) J(B), 


Q(x, 8) = (t(x) — r(¢(8)))° J(8)i(8)~* (8) " (t(@) — 7(6(8))). 


Proof. This is obtained by composite differentiation of the log-likelihood function. 
We get 


S(0,B) = F ((8)" He) — W(6(8))) = (He) — r(0(8))" IO) 
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and further, differentiating once more 


ig(e.s) = Se 
_ dul) k  Abu(B) 86,4(8) 
= 2 apap, Mu WHA) + De DB; K(9(8)) ww aby)” 


or in a more compact form 
I(x, 8) = J(B)" «(6(B))I(B) + >> D?6(B)ulau — Tu($(8)))- 


We get the Fisher information by taking expectations and noticing that the last term 
has expectation zero whence 


i(8) = J(B)' K(G(B))J(8). (3.11) 


The expression for the quadratic score now follows. 


Remark 3.32. Note the analogy between the expression (3.11) and the correspond- 
ing expression for the linear normal model in (2.10). 


3.7. Exercises 


Exercise 3.1. Let X and Y be independent and exponentially distributed random 
variables with E(X) = yw and E(Y) = 2, where 4 > 0. Represent the fam- 
ily of joint distributions of (X, Y) as a one-dimensional regular exponential family, 
identify the base measure, the canonical parameter, canonical statistic, and cumulant 
function. 

Exercise 3.2. Consider the family of negative binomial distributions, i.e. distribu- 
tions of a random variable X with densities 

z+r—-1 


Pk=a)={ ae Ja-we 


with respect to counting measure on No = 0,1,.... Here r € N is considered fixed 
and known, whereas jz € (0, 1) is unknown. 


a) Represent this family as a regular exponential family of dimension | and identify 
base measure, canonical parameter, canonical statistic, and cumulant function. 


b) Find the mean and the variance in the distribution E,,(X) and V,,(X). 


Exercise 3.3. Consider the family of Pareto distributions with densities 


fa(x) = On 1 ey (x) 


with respect to standard Lebesgue measure on R, where a > 0 is unknown. Rep- 
resent this family as a regular exponential family of dimension | and identify base 
measure, canonical parameter, canonical statistic, and cumulant function. 
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Exercise 3.4. The inverse normal distribution has density 


with respect to standard Lebesgue measure on R+. You may without proof assume 
that [5° fu,a(w) dx = 1 for all (44, 4) € R32. Let now 


P ={P,,u>0,r>0} 


where P,,,, has density function f,,,, as above. 


a) Represent the family P as an exponential family of dimension 2 and identify the 
base measure, canonical parameter, canonical statistic, and cumulant function. 


b) Show that the mean and variance in the family is given as 


3 
[sb 
Ey(X) = 4, Vya(X) = 7. 
Now consider the subfamily of inverse normal distributions where the mean is equal 
to the variance, i.e. where \ = pu”. This is known as the standard inverse normal 


distributions. 


c) Argue that this family is an affine subfamily of the full family and thus forms a 
regular exponential family of dimension 1; 


d) Identify the base measure, canonical parameter, canonical statistic, and cumulant 
function in this subfamily. 


e) Determine the mean and variance of Y = X~. 


Exercise 3.5. Let (X,Y) be random variables taking values in No x R+ with a 
distribution determined as follows: X is drawn from a Poisson distribution with mean 
A yielding the value x and subsequently Y is drawn from a gamma distribution with 
shape parameter x + 1 and scale parameter (. In other words, the joint distribution 
of (X, Y) has density 


x 
ax Y = 
foray (@y) = as Betty © wW/B, y > 0, x=0,1,... 
with respect to m x v, where m is counting measure on No and v is the standard 
Lebesgue measure on R41. 


a) Argue that the family of distributions with unknown (A, 3) € R2., may be repres- 
ented as a minimal and regular two-dimensional exponential family, determine 
the canonical parameters, the canonical parameter space, associated canonical 
statistics, and cumulant function. 


b) Find the mean and covariance matrix for (X,Y)'. 


c) Show that the subfamily given by the restriction X = £ is a minimal and reg- 
ular one-dimensional family and determine the canonical parameters, associated 
canonical statistics, and cumulant function. 
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Exercise 3.6. Let (X,Y) be random variables taking values in R; x No with a 
distribution determined as follows: X is drawn from an exponential distribution with 
mean \ yielding the value x, and subsequently Y is drawn from a Poisson distribution 
with mean Ax. In other words, the joint distribution of (X, Y) has density 


y 
xe (Bx) eb 


x>0, y=0,1,... 
y! 


fora) (@y) = 


with respect to v x m, where v is the standard Lebesgue measure on R and m is 
counting measure on No. 


a) Argue that the family of distributions with unknown (A, 3) € R?, may be rep- 
resented as a minimal and regular two-dimensional exponential family, determine 
the canonical parameters, the canonical parameter space, and associated canon- 
ical statistics, and cumulant function. 


b) Find the mean and covariance matrix for (X,Y)'. 


c) Show that the subfamily given by the restriction \ = ( is a curved exponential 
family of dimension one and order two. 


d) Find the log-likelihood function, score function, Fisher information, and quad- 
ratic score for 3 in this subfamily. 


Exercise 3.7. Consider the family of gamma distributions with identical parameters 
for shape and scale, i.e. with densities 


po-le-2/8 
T(6)68 ’ 


with respect to standard Lebesgue measure on R. 


fa(z) = BER, 


a) Represent this family as a curved exponential family of dimension one and order 
two. 

b) Find the log-likelihood function, score function, Fisher information, and quad- 
ratic score for the family. 

Exercise 3.8. Let X and Y be independent random variables with X Poisson’s dis- 

tributed with mean \ and Y exponentially distributed with rate A, where > 0. 

a) Represent the family of joint distributions of (X,Y) as a curved exponential 
family of dimension one and order two. 

b) Find the log-likelihood function, score function, Fisher information, and quad- 
ratic score for the family. 

Exercise 3.9. Let X and Y be independent and exponentially distributed random 

variables with E(X) = 8 and E(Y) = 1/6 where 6 > 0. 

a) Represent the family of joint distributions of (X, Y) as a curved exponential fam- 
ily of dimension one and order two. 


b) Find the log-likelihood function, score function, Fisher information, and quad- 
ratic score for the family. 
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Exercise 3.10. Let (X1,¥1,..., Yn, Xn) be independent and exponentially distrib- 
uted random variables where X; and Y; have densities with respect to standard 
Lebesque measure: 


fief a SO) ap) Ae SO, Pada 


and (6,A) € Ri both unknown. Note that Y,,...,Y,, are identically distributed 

whereas this is not the case for X1,..., Xn. 

a) Argue that this specifies a minimal and regular exponential family of dimension 
two, identify the base measure, canonical statistic, and cumulant function. 


b) Consider the subfamily determined by the restriction (0, A) = (8,1/) and show 
that this is a curved exponential family of dimension one and order two. 


c) Find the log-likelihood function, score function, Fisher information, and quad- 
ratic score in this subfamily. 
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Chapter 4 


Estimation 


4.1 General concepts and exact properties 


An estimator is a measurable function t : Y — © from the sample space to the 
parameter space; the value ¢(x) of the estimator is an estimate based on data x. The 
estimator represents a guess on the value 0 when we assume that x represents an 
observed outcome X = z of the random variable X having distribution Pg € P. We 
often write 
6 = 6(X) =4(X) 
emphasizing the fact that also the estimate 6 is a random variable, assuming that O 
has been equipped with a measurable structure so that (O, T) is a measurable space. 
In most of this book we consider the case where the parameter space is a subset 
© C R* for some k, so O inherits the Borel-o-algebra from R*. 

Also, we may in general not be interested in guessing 0, but only a specific para- 
meter function of interest ¢ : @ ++ A C R%; so our estimator will be of the form 


X= 6(6) = t(X), 


where t : X — A. For a number of reasons, it is convenient to allow the estimator to 
take values outside A, so we define formally: 


Definition 4.1. Consider a parametrized statistical model on the representation space 
(¥, E) with associated family P = {Po | 0 € O} and a parametric function ¢ : 0 
A C R®. An estimator of \ is a measurable map t : ¥% —> R?. We say that the 
estimator is well-defined on B = {x : t(x) € A} C ¥ and the set A= %X\B= 
t~1(R@ \ A) is the exceptional set of the estimator. 


Clearly we would wish A to have small probability under Pg as our guess on 
would otherwise mostly be very bad. But we would also be more ambitious and wish 
to ensure that ) is a ‘ good guess’ and therefore not too far from the value \ = ¢(6) 
corresponding to Pg. In other words, we would wish that the distribution ¢(P,) is 
concentrated around (6). 

One way of measuring this concentration is the mean square error (MSE) which 
is defined as 


MSE¢(A) = Eo(||\ — All?) = Eo((lé(X) — All?) 
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where \ = t(X). We would also in general worry if our estimator was ‘system- 
atically wrong’ in some way and introduce the notion of bias of an estimator as the 
difference between its mean and the value of the parameter it is supposed to estimate: 


Bo(A) = Bo(A — A) = Eo(t(X) — A) 
and note that we then have 


MSE¢(\) = Ea(|A— All?) 
= E4(|[\— Eo(A)|?) + Bo(|[Eo(A — A) ||?) 
= tr(Vo(A)) + ||Boll’, 


i.e. the MSE of an estimator is the sum of the trace of its variance and the squared 
norm of its bias. Here the trace of a symmetric matrix is tr(A) = 5°, aji, i.e. the 
sum of its diagonal elements, see Section A.5 for further details. 

If A € R is one-dimensional, it is customary to use the term standard error for 
the square root of the variance 


SEo(\) = / Vo(A) 


so we then have . . 7 
MSE¢(\) = SEo@(A)? + Bo(A)?. 


Beware the difference between standard deviation and standard error; the first is 
used to measure the spread of the distribution of X, whereas the second is used 
for the variability of the estimate of a parameter. The standard deviation of X when 
X ~ N(E, 0?) is o, whereas the standard error of the mean X,, = (X,+---+X,)/n 
is then 7 . 


SE¢ 02 (Xn) bes Vn 


Ideally, when constructing an estimator, we would wish to reduce both bias and 
variance, but this is not always possible; if bias is minimized, variance will often 
increase and vice versa so there could be a trade-off. An estimator is said to be 
unbiased if Bg(A) = 0. 

Example 4.2. [Uniform distribution] Consider the problem of estimating the un- 
known parameter in the uniform distribution on (0,0) where 6 € © = R,, corres- 
ponding to the model in Example 1.8. We could consider the estimators 


= 2 a 
6, = —(X eee Xn), On = max(X1, nis Xn) 
n 
where the rationale for the first estimator 6,, is that the expectation of a single ob- 
servation is Eg(X;) = 0/2, and the rationale for the second estimator @,, is that the 
largest value will be close to the true value. We first note that the estimator 6,, is 
unbiased, 1.e. 
~ 2 6 
Eo(6,) = —n==9 
(On) n' 2 


GENERAL CONCEPTS AND EXACT PROPERTIES 715 


so the MSE is equal to its variance 


~ ~ Se a 

M E n)= V On = — = =s, 

SEOs) = Valen) = ao = ap 

where we have used that Vo(X;) = V(@U) = 62V(U) where U is uniform on (0, 1) 
and 


(4.1) 


vio) =U) BO)? = faa (fina) =2-2=3, 


Considering the second estimator, we first find its distribution function, exploit- 
ing that the maximum is smaller than z if and only if all of the variables are; so that 
for x € [0, 6] we have 


G3(x) = Po{bn <2} =|] Pe{Xi se} = 
i=1 bee 
yielding the density by differentiation 
net 
go(x) = gn 100.) (2).- 
We may now find the first two moments of the distribution: 
0 n—-1 0 n-1 
B n2 n - nx n 
Eo(9n) = dx = 0, Eo(62) = : dx = 6? 
0(bn) = ff oar = "8, Bol) = fo? ae = 


yielding the variance 


2 2 
mi. ENS oS n _ ne 
Violin) = 45° (4°) ~ (n+2)\(n +1)?’ 


In contrast to the unbiased estimator Ons the estimator bn has a negative bias 


B(6,) = “ oe 
n n 


reflecting that the estimator is systematically too small. However, the MSE of bn is 


MSE9(6,) = Vo(9n) + B(On)? 
2 2 2/573 2 
a no ie _ (an + 4n* + 5n + 2) (4.2) 
(n+2)(n+1)? — n? n?(n + 1)?(n + 2) 


which approaches zero at the speed of n~? which is much faster than the MSE of Ons 
approaching zero at the speed of n~1. The estimator 0, is biased, but it can be bias 
corrected by letting 

n+1. n+1 


6, = On = max(X1,...,Xn) 
n n 
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n=10 n=10 n=10 
» SJ > J bo 
5 4 5s § $s 
3 ° 3 S54 3. Ss 
84 gq z= z = 
(a 4 i 7 i 7 
S | LL o A —— 
Foseei#h#feéiio.liq i i (ae ee ia | 
0.5 1.5 2.5 0.5 1.5 25 0.5 1.5 2.5 
d d d 
n = 100 n = 100 n = 100 
a Fo By a By 4 
5 so | 6 e7F § ef 
=3 = | 3 So f=3 ae | 
BOUT a = 5 aS 
i + i = ica 4 
oH oH a 
0.5 1.5 2.5 0.5 1.5 25 0.5 1.5 2.5 
d d d 


Figure 4.1 — Estimation of the range of a uniform distribution. The diagrams compare 
the three estimators in Example 4.2 for 5000 simulations of n = 10 observations (top 
row) and n = 100 observations (bottom row). 


which is now unbiased and therefore has MSE equal to its variance 


MSE4(6,) = Vo(6n) = ~———— 


which may be considerably smaller than both of MSEo(6,,) as given in (4.1) and 
MSEo(0,,) as given in (4.2). Thus, the estimator @,, appears preferable. These facts 
are illustrated in Fig. 4.1 where the estimators have been applied to simulated data 


with n = 10 and n = 100 observations. 


We should also note that the notion of unbiasedness is not equivariant, as the 
following example illustrates. 


Example 4.3. [Estimating the variance in a normal distribution] Consider a sample 
X =(X1,...,Xn) froma normal distribution V(£, 07), where € € R and o? € Ry 
are both unknown. If our parameter of interest is 6(£,07) = o?, it is customary to 
estimate o? as 


42 o2_ |X —X|/ 1 < =\2 
. n—-1 nai ee ) 
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and since Y = ||X — X||? has a o??-distribution with n — 1 degrees of freedom 
(see Example 2.26), we have 
—1)o? 
E, -2(62) = (n 2 
g, 2 (6 ) as I oO 

so $7 is indeed an unbiased estimator of 7”. However, maybe we really would rather 
estimate the standard deviation o and would then typically do so by taking the square 
root of the above estimate. Unfortunately, the estimate for o is not unbiased as 


Bea(@) = Bow V¥) < yl Bewl¥) = 0 


where we have used that for any positive random variable Z we have (E(VZ \)7-< 
E(vV Z2) = E(Z) with equality if and only if Z is almost surely constant. See also 
Exercise 4.2 for further aspects of this issue. 


An unbiased estimator that has minimum variance among all unbiased estimators 
is an MVUE (Minimum Variance Unbiased Estimator). Such estimators are usually 
considered attractive; however, in many examples there is only a single unbiased 
estimator and the property may then be less impressive. 

Sometimes it may be of interest to limit the type of functions g when constructing 
estimators, e.g. to functions that are linear. If an estimator has minimal variance 
among all linear unbiased estimators it is a best linear unbiased estimator (BLUE). 

The Fisher information is playing an important role in the sense that it gives a 
lower bound on the variance term for an estimator. This is known as the Cramér—Rao 
inequality. We first give this in the simplest case of a one-dimensional parameter. 


Theorem 4.4 (Cramér—Rao inequality). Consider a statistical model with a smooth 
and stable family of distributions with parameter space © C Rand leto?: 0 4R 
be a smooth parameter function. Further, let X = t(x) be an estimator of X = 


@(0) with expectation Eg(A) = g(0) where g is smooth. Suppose that every 0 has a 
neighbourhood Us so that for all n € Ug we have 


|fn(@) — fo(x)| < fo(a)|n — 0|Ho(x) (4.3) 


where Eo(Ho(X)?) < 00. Then 


Vo(A) = 


(4.4) 


Proof. Consider an arbitrary 7 € © and the random variable 
Qn.o = fn(X)/fo(X) = Lx(n)/Lx (8), 
i.e. the ratio of the likelihood functions at 7) and 6. We then have 


= fn (2) = = 
Bo(Qno) = f $= fale) dua) = fh fy(2) dtc) = 1 
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and thus 


3 7 fn(2) _ * ent _ 
Wold. Qno) = f tla) ($42 —1) fale) duz) = ale) ~ a0) 


By Cauchy—Schwarz’ inequality, we now get 
(g(n) — g(8))?_ <  Vol(t(X))- Vo(Qn,e) 


Now divide both sides of the equation with (7 — 0)? and let 7 — 6. Dominated 
convergence assuming the regularity condition (4.3) yields the limit 


i fala) du) 


g'(9)” < Vo(A)Vo(S(X, 6) = Vo(A)i(). 


Dividing both sides of the inequality by i(0) yields (4.4) as desired. 
The uniform distribution in Example 4.2 does not satisfy the conditions for the 
Cramér—Rao inequality to hold. However, we have 


Proposition 4.5. In a curved or regular exponential family, the conditions in The- 
orem 4.4 are satisfied. 


Proof. See Appendix B.1.2. 


Remark 4.6. In the special case where X = @ and 6 is an unbiased estimator so that 


(0) = Eg(@) = 0 and thus >'(0) = 1, we get the simple Cramér—Rao inequality: 
Vo(8) > i(6)71, (4.5) 


i.e. the inverse Fisher information is a lower bound for the variance of any unbiased 
estimator. 


This implies the following: 
Corollary 4.7. If an unbiased estimator satisfies Vo(0) = i(0)~!, it must be an 
MVUE. 

An MVUE with a variance that is optimal in the sense that it attains the Cramér— 
Rao bound V9(6) = i(0)~! is said to be efficient. 
Example 4.8. [Estimation in the exponential distribution] Consider estimation of the 
mean @ in an exponential distribution 


1 
fo(z) = ge zr>0 


based on independent and identically distributed observations X1,..., Xp, and let 
us for simplicity assume that n = 2m -+ 1 is odd. We may consider the following 
three estimators 


ae oe ee eee Aci x 
6, = ee med Ay ) 6, = nmin(X,...,Xn) 


n ———_ On = 


n k(m) , 
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where 
mt+1 1 
m) = 3 ree 


We shall first argue that estimators are all unbiased. 

This is obvious for the first estimator, 6,, since Eo(X;) = 0. To show that the 
second estimator is unbiased, we use a classical result by Rényi (1953), saying that 
the p-th order statistic X(,) from n independent and identically exponentially distrib- 
uted random variables has the same distribution as a weighted sum of independent 
exponential random variables Z1,..., Z, with expectation E(Z;) = 1 as 


Pp 
para = 


It follows that the mean and variance of the sample median of an odd number n = 
2m + 1 of exponentially distributed random variable is 


Z 


Eo(med(X1,...,Xn)) =O@k(m), Vo(med(Xj,...,Xn)) = 6? v(m) 


where 
ar (2m — j + 2) 


and hence bn is unbiased. For the last estimator, we use that 


Py fein X55 Xn) > @} = | | PX a} er 


i=1 
showing that Y,, = min(X1,...,X,,) is again exponentially distributed but with 
mean 0/n, and hence 
P 6 
Eo(0,) = Eo(nY,) =n-— = 0 
n 


so 6,, is also an unbiased estimator. 
Since all three estimators are unbiased, it makes sense to compare their variances. 
We have 


Wold) ==, Voldn) =O, olin) =n? (2) = 0 


Clearly, the last estimator is really bad, as the variance does not even decrease with 
n. It may not be immediately obvious that 6, has a larger variance than 6,,, i.e. that 
v 1 
(m) . 
k(m)? ~ 2m+1 
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but this is actually the case. Indeed we calculated the Fisher information for @ in 
Example 1.32 for a single observation to be i(@) = 0~?, and hence for n observa- 
tions, the Fisher information is 7,,(9) = ni(@) = n6@~?; thus the lower bound in the 
Cramér-Rao inequality (4.5) is 6? /n, which is attained by On. 

In other words, 6, is an efficient estimator and an MVUE by Corollary 4.7 and 
therefore its variance and mean square error is at least as small as that of any other 
unbiased estimator. 

To identify how much is actually gained by using 6, instead of 6,, for n large, 
we may find bounds for k(m) and v(m) as 


2m+2 Qm+1 
1 i am+1 
/ = de = log? < k(m) < [ = dx = log — 
av m 


m+1 m 


a 1 r _ 1 2 y< 1 F > m+1 
mi a Ome oe - x? * mm +1)’ 


L 


where we have compared the sums defining k(m) and v(m) to Riemann sums for 
the corresponding integrals. We have thus 
a Vo(Gn) (2m + 1)v(m) 1 


reff(@,,,6,) = lim =— = lim a = 2.08 
( ) SOO Vo(On) m-— oo k(m)? (log 2)? 


implying that we need about twice as many observations using On, instead of 6, to 
get the same precision for our estimate. The quantity reff (On, bn) calculated above is 
known as the asymptotic relative efficiency of On, to On. 

The findings above are illustrated in Fig. 4.2 where the estimators have been 
applied to simulated data with n = 11 and n = 51 observations. 


For the sake of completeness, we also give Cramér’s version of the theorem with 
a slightly different regularity condition, which in contrast to the above involves the 
estimator ¢ itself. The proof is simpler, but the conclusion of the theorem is also 
weaker. 
Theorem 4.9 (Cramér’s inequality). Consider a statistical model with a smooth and 
stable family of distributions with parameter space O C R, lettd?: O > Rbea 
smooth parameter function and = t(x) an estimator of ) = $(0) with expectation 


Eo(A) = y(@) and assume 


fl le(2)lgo(2) w(de) < 00 (4.6) 


where go is u-integrable and dominates Of,(x)/On for n € Uo, where Up is a neigh- 
bourhood of 0. Then ¥ is differentiable and 


Vo(A) = 


Proof. The condition (4.6) ensures + is differentiable with derivative found by dif- 
ferentiation under the integral sign so that 


(0) = f 2) (ae) = a(t X)S(X,0)) = Bol 1X) ~ 9(0))S(X,9)) 
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Figure 4.2 — Estimation of the mean of an exponential distribution. The diagrams com- 
pare the three estimators in Example 4.8 for 5000 simulations of n = 11 observations 
(top row) and n = 51 observations (bottom row). 


where we have exploited that E9(S(X, @)) = 0. Now the Cauchy—Schwarz inequal- 
ity yields . 
7(9)? < Vo(t(X))Vo(S(X, 8) = Vo(A)i() 


and the inequality follows. 


The Cramér—Rao inequality has also a multivariate version. 


Corollary 4.10 (Multivariate Cramér—Rao). Consider a statistical model with a 
smooth and stable family of distributions with parameter space O © R*, let 
@ : © > R?¢ be a smooth parameter function, and \ = t(X) an estimator of X 


with expectation Eg(A) = g(@), where g is smooth. Suppose further that every 6 has 
a neighbourhood Us so that for all n € Ug we have 


|fn(2) — fola)| < fo(x)Ho()|In — 4]| (4.7) 
where Eg(Ho(X)?) < co. Then its covariance matrix V(X) satisfies 
Vo(A) — Dg(0)i(@)—!Dg(0) | is positive semidefinite, (4.8) 


where Dg(@) is the Jacobian matrix with entries 0g;(0) /00;. In other words, it holds 
for all u € R¢ that 


ul Vo(A)u > ul Dg(0)i(0)~!Dg(0) "wu. (4.9) 
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Proof. We show this by reducing the statement to the one-dimensional case. So 
choose u € R¢@ and v € R* and consider the estimator t,,(X) = u!t(X) of u!A. 
And consider for any fixed 0 € © the subfamily determined by 


0, =ON(0+6v,6 ER). 


We then have 
Eo4su(tu(X)) = hu(d) = wu! g(6 + dv) 


and by composite differentiation 
hi. (5) =u! Dg(@ + dv)v. 
Further, for the score and information S and 7 in this submodel we have 
S(5,X) = S(@+ dv), 7(5) = Eo4sy(S(6, X)'$(6,X)) =v! i( + dv)v. 


In this subfamily, the condition (4.7) clearly implies (4.3) so the one-dimensional 
version of the Cramér—Rao inequality yields 
(u' Dg(@ + dv)v)? 

vl i(8 + dv)u 


Vo+s0 (tu (X)) 2 
Letting 5 = 0 and using that Vo45,(tu(X)) = u' Vo(t(X))u we get 


ul v)? 
ul Vo(t(X))u > aa ee 


(4.10) 
This inequality holds for any choice of v € R*; in particular we may choose 


v =i(0)-1Dg(0)'u 


and get 
u! Dg(0)u = u! Dg(0)i(0)~'Dg(0) 'u 


and also 


v'i(0)v = u! Dg(0)i(0)~*i(0)i(0)~1 Dg(0) 'u = u' Dg(0)' i(0)~1 Dg(0) 'u. 


Inserting the last two relations into (4.10) yields (4.9), as required. 


4.2 Various estimation methods 


In this section we briefly describe a few important and classical methods for con- 
structing good estimators. This book shall primarily be concerned with the method 
of maximum likelihood as this is almost universally applicable and based on a simple, 
yet ingenious principle. However, some of the methods described below may be used 
as a supplement or substitute when the method of maximum likelihood for some 
reason fails or leads to difficult computational problems. 
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4.2.1 The method of least absolute deviations 


The method of least absolute deviations is one of the earliest estimation methods 
and goes back to at least Galileo Galilei (1564-1642) and was also used extensively 
by Pierre Simon Laplace (1749-1827); see for example Hald (1990, p. 49 ff) and 
Hald (1998, p. 112 ff) for further details. For an observation 7 € R¢, an unknown 
parameter 6 is sought estimated as 


d 
Oi = argmingee||% — Eo(X)|]1 = argmingce > |xi — Eo(Xi)|. 
i=1 

A main reason for this method to have come out of fashion in the early 19th century 
was associated with computational difficulties. There was typically no simple and 
explicit solution of the minimization problems, even if the expectation Eg(X) was 
a linear function of 6, and it was quickly taken over by the method of least squares, 
where minimization could be performed by solving a system of linear equations; see 
the next subsection. 

However, it has had a renaissance in modern times, not least because of progress 
in the theory and practice of convex optimization; if Eg(X) = A@ is a linear function, 
the function to be minimized 


g(9, x) = ||w — Adli 


is a convex function of @ and good methods now exist for minimizing such functions; 
see e.g. Boyd and Vandenberghe (2004). 


Example 4.11. Let us assume that 7 = (21,...,2,) is a sample from a distribution 
on R with Eg(X;) = 6. Now estimating 0 by the method of least absolute deviations 
leads to determining the estimate as 


n 
Oip = argmingce y |x; — |. 
i=1 


Let x(;) denote the ith order statistic, i.e. the sample is ordered so that 
H(i) SQ) St Sk) 


and note that then the objective function may be written as 


gn(8) = > lei — A = Do law — 41. 
i=1 i=1 


If x(-) < @ < p41) we further get 


gn(9) 


I 
M4 
DS 
| 

BS 
++ 
raat 

= 
| 
S 
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so in this interval we have g/,(9) = 2r — n. Thus the objective function is decreasing 
in 6 for r < n/2 and increasing in 0 for r > n/2. It follows that if n = 2p 4+ 1 is 
odd, the function has a unique minimum at 6 LD = (p41) OF if n = 2p is even, any 
value in the interval [a (,), %(p)+41)] is a minimizer of g,, (0). In other words 


Arp = med(21, ous als 


i.e. the LD-estimator is the sample median, defined as the interval |x(p),%(p+1)] if 
n = 2p is even. 

An example of the LD function for a simulation of 51 exponentially distributed 
observations with mean 6 = 2 is displayed in Fig. 4.3. 


Generally, an advantage of the method of least absolute deviations is that it tends 
to be relatively robust to data errors. For example, if one or two of the data points are 
perturbed due to, say, recording errors, the median will only be vaguely affected and 
may not be affected at all. However, it may not always be a good estimator unless 
the distribution of X is roughly symmetric around @ as illustrated in Example 4.12 
below. 


Example 4.12. [LD estimator in exponential distribution] Consider estimation of 
the mean 6 of an exponential distribution based on n = 2m + 1 observations, as 
in Example 4.8. In this example we calculated that the sample median (i.e. the LD 
estimator of #0) had mean and variance 


mn 6? 
Eo (6p) = 0k(m) & 8 log 2 = 0.693 x 0, Vo(@Lp) = 8?u(m) & — 


yielding the MSE 
) 2(1 2 
MSEo(6.p) ¥ 0 — + (log 2 — 1) 
n 


so the LD estimator is systematically far too small and the bias is dominating the 
error as n increases. 

To make this estimator work reasonably, it is necessary to apply a bias correction 
as in Example 4.8 to obtain 0, = 0,p/k(m) as we have seen. The phenomenon is 
illustrated for 21 simulated observations in Fig. 4.3. 


4.2.2 The method of least squares 


The method of least squares—or ordinary least squares (OLS)—estimates the un- 
known parameter by minimizing the sum of the squared deviations of the observa- 
tions from their expectation: 


Bors = argmingee||z — Eo(X)||5 = argmingee > (2; — Eg(X;))”. 
i=l 


The method is usually attributed to Carl Friedrich Gauss (1777-1855) although it 
was first published (in 1805) by Adrien Marie Legendre (1752-1833), see Hald 
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Figure 4.3 — The LD function gn (9) = >" |: — 0| for a sample of n = 21 simulated 
independent and exponentially distributed random variables with mean 6 = 2. The true 
value is indicated by a solid line, and the 6 Lp by a dashed line. The bias corrected 
estimator On is indicated by a dotted line. 


(1998, Ch. 19 and 21). However, there is no question that Gauss made a comprehens- 
ive study of the method and its properties—including Theorem 4.13 below—and he 
was a wizard in using it in every thinkable way. 

One main advantage of OLS is that it has a simple computational solution in the 
case when ps = Eo(X) = AQ is a linear function of the parameter 0, 0 € O = R*, 
and Ais ad x k matrix. Then an alternative formulation of the minimization problem 
is to find 


fos = argmin,<p||a — pl|5 


where L = (y € R¢| 40 € R* : y = AO) is the linear subspace of R@ determined as 
the image of the map 0 ++ A@. Thus the minimizer is simply given as the orthogonal 
projection onto L of the observation zx: 


fious = Adois = Uz (2) 


where II; is the projection onto L. Further, this projection can be calculated by 
solving the linear equation system 


A'AO=A'le 


known as the normal equations and thus the computational issues are benign and 
simple, and extensively studied. 
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In addition, the OLS has the following optimality property, known as the Gauss— 
Markov theorem: 


Theorem 4.13 (Gauss—Markov). Consider a model P = {P| 0 € ©} on the rep- 
resentation space R¢ with finite second moments and covariance proportional to the 
identity V9(X) = 071g. Consider the parameter function 4. = (0) = Eg(X) and 
assume jt € L where L is a linear subspace of R“. Then the OLS estimator 

fois = U(X), 


is a best linear unbiased estimator (BLUE) of js in the sense that for any other linear 
unbiased estimator jt = t(X) it holds that 


Vo(t(X)) — Wo(IIz(X)) is positive semidefinite. 


Proof. Let t(X) = IIz(X) + DX. Since t(X) is unbiased we have for all yp € L 
that 
Eo(t(X)) = Bo(IIn(X)) + Bo(DX) = w+ Du= pu 


and hence Du = 0 for all x € L, implying that DII; = 0 and thus also 
i = (Ony)" S00) So; 
But then, using that II, is idempotent and symmetric, we have 
Vo(t(X)) = o° (I, +D)(M,+D)" 
o7 (I, +U,D' + DUz+ DD") 
= oII,p+o*DD' = Vo(fiots) +o2DD'. 


Since ¢?DD' is positive semidefinite, the proof is complete. 


In the case where the covariance Vg(X) = © is not proportional to the identity 
we would use weighted least squares (WLS) and determine the estimate as 


Awis = argmin, ey .||z — ulliy = aremin,e,,(@ — w) "W(x — p), 
where W = ¥~! is the weight matrix. The normal equations for WLS are still linear: 
A'WAO = A'Wa 

with the solution 6 = (A'W.A)~!A' Wz and hence 
Ab = A(A'WA)-!A' Wer = Iz (2) 


since this expression is indeed identified as the matrix for the projection onto L with 
respect to the inner product (x, y) yj = x' Wy; see also Proposition 2.21. 

The conclusion in the Gauss—Markov theorem also holds in this generality— 
as also shown by Gauss—and the proof is essentially identical to the proof above. 
Note that in the case where X ~ Na(p,%) with u = AO, the WLS estimate is not 
only BLUE, but in fact MVUE as the variance attains the Cramér—Rao lower bound. 
Indeed, we have 


Vo(9) = (ATWA)1A'WaWA(ATWA)7! = (ATW)! = 4(0)7}, 
where the last equality is from (2.10) . 
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4.2.3 M-estimators 


A large class of estimators that include the LD and OLS estimators are constructed in 
the following way. For 4 = R* and © C R* we consider a functiong: ¥xO1 R. 
Then for a sample (21, ..., 2), we consider the estimator 


8, = argmingco S 9(0.2) 


i=l 


Often it is assumed that the function g is convex in @ and for every x has a unique 
minimum for 9 = x or when Eg(X) = x. Examples of such functions that we have 
seen already are g(0, x) = |x — 6|, and g(0, x) = (x — 0)”. Indeed the LD and OLS 
estimators are examples of estimators constructed in this way. 

An estimator of this kind is known as an M-estimator. We shall not discuss these 
in any detail in these lecture notes but point out that many estimators used in modern 
machine learning are of this kind and under suitable smoothness assumptions they 
tend to have benign properties for large n. 


4.2.4 The method of moments 


In this section we shall discuss estimators that are constructed by matching the em- 
pirical averages of specific functions to their expectations, or moments. We consider 
a sample X,...,X, from a parametrized family P = {P»|0 € ©} on the repres- 
entation space (4’, E). In addition we consider a statistic t : Y ++ R* with finite 
expectation for all 9 € O. 


Definition 4.14. The moment function m : © ++ R* of the statistic ¢ is the function 
m(0) = Eg(t(X)). 


Typical choices for Y = R would be t(x)' = (x, x?) corresponding to the first 
two moments of the distribution Py, but many other choices of t are possible. If the 
moment function is injective, we define the moment estimator as follows. 


Definition 4.15. Let X,,...X,, be a sample from a parametrized family P = 
{Po|0 € O} andt : X +> R* a statistic with an injective moment function 
m : © — R¥ and let 


_ Tce 
T, = — DHX). 


The moment estimator Omom Of 6 based on m is well-defined if T;, € m(O) and is 
then obtained by equating the empirical and theoretical moments: 


In other words, we have Omnom = m- (Le) 
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The method of moments was advocated by Karl Pearson (1857-1936), who used 
the method with moment function t(x) = (a, x?,a3,x*)', ie. based on the first 
four moments of the distribution; in addition he introduced a ‘system of curves’, i.e. 
a four parameter family of distributions that could then be fitted with the method of 
moments; see Hald (1998, p. 721). Later it has been established that this method is 
not always good, as the empirical moments of high order tend to be very unstable 


and have a huge variability. 


Example 4.16. [Moment estimators in the normal model] Consider the simple nor- 
mal model determined by X,,..., X,, being independent and normally distributed as 
N(€,1), where € € R is unknown. There are a variety of possible moment estimat- 
ors of €, depending on what we choose as moment function. We could, for example, 
consider 


h@)=2, ba)=16.)(2); B@)=2 


with corresponding moment functions 
mi(é)=€, mea(f) = Pe(X >0)=B(€), ma(x) = 3€ + €° 
since X 2 Z + € where Z has a standard normal distribution and thus 
E,(X*) = E((Z + €)°) = E(Z°) + 3€E(Z*) + 36°E(Z) + €° = 36 + &. 


These moment functions are all injective; the image of m, and mz is all of R, so 
the moment estimators & and &; are always well-defined; whereas mz maps R in- 
jectively to (0,1) so &5 is well defined if and only if the observed sample 71,..., pn, 
contains both positive and negative values, which of course will happen with high 
probability if n is large. If we let 


1 ‘ee ie 
An=— dX, B m= 7d Mowe Di On De 


the corresponding moment estimators are 


Ein = An, ae — ®-*( By); Ean = g(Cn). 


Here © is the standard normal distribution function and 


g(y) = 2sinh (5 sin (4) 


is the unique real root of the third-degree equation 

x? + 32+y=0 
expressed in terms of the hyperbolic sine function sinh 
er =e 


sinh(a2) = 5 
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Figure 4.4 — Simulation of 5000 estimates of € with the three moment estimators in 
Example 4.16. The standard average to the left is clearly the preferable estimator and the 
estimator in the middle is the worst. 


We shall later, in Section 5.2, return to this example and discuss the properties of 
these estimators for large n. At this point we restrict ourselves to showing the results 
of a simulation where we have estimated € 5000 times using these three methods for 
a sample size of n = 100. The outcome of the simulation is displayed in Fig.4.4; the 
plots indicate that é is the best, whereas b, the worst of these estimators. 


Moment estimators tend to be sensible, are often easy to construct and compute, 
and as such they can be quite useful; however, they may also be very inefficient. The 
following is an example of a moment estimator that is quite often used in practice. 


Example 4.17. Consider the family of gamma distributions, with densities with re- 
spect to Lebesgue measure 


got 


: = —2/B 
F(a; a, 3) —_ BeT (a) ’ L> 0, 


where a > 0 and 8 > 0. For this family we have the first and second moments 
Ea.p(X) = a8, Eu,g(X”) = a(a + 1)8? 


and thus m(a, 8)' = (a8, a(a + 1)67). Thus the moment estimation equation is 


K=08, ~S Xx? =o(0+1)6? 


90 ESTIMATION 


with the solution 
Qmom = X*/5?, Bmom = 5?/X 


where we have let 


y= (>: cc x) => (a - XY. 
w=1 L 


Note that the sums of squares of deviations in the last line are divided by n and not 
nm—1. 


An issue with the method of moments is the choice of statistic t which has to be 
made ad hoc, and there seems to be no guiding principle for doing this. Nevertheless, 
the method can be quite handy in many cases. In Section 4.3.2 we shall see that for 
exponential families there is a canonical choice. 


4.3. The method of maximum likelihood 
4.3.1 General considerations 


The estimation methods mentioned above have all been ad hoc and have not been 
exploiting properties of the specific model although Example 4.12 clearly shows that 
estimation methods may be very bad if applied to models that do not go well with 
the methods. 

In contrast, the method of maximum likelihood as invented by R. A. Fisher yields 
a universal estimation method for any dominated statistical model which is specific- 
ally designed to that model. 

More precisely, for any parametrized and dominated statistical model P on a 
representation space (4, E) and associated family of densities F = {f9|0 € O}, 
we define the maximum-likelihood estimator (MLE) as 


Our => argMaxgce £(0) 


provided this is well-defined. So if we have a sample (X,,..., X,,) from P, the MLE 
is defined as 


Our = argmaxyee oa ex, (8) 
i=l 


so the MLE is an M-estimator based on the negative of the log-likelihood function 
g(9,z) = —£,(0). Indeed, the LD, OLS, or WLS estimators are all maximum- 
likelihood estimators corresponding to specific models as we shall see. 


Example 4.18. [Laplace model] Consider a sample x = (21,...,2») from the 
Laplace distribution with density 
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with respect to standard Lebesgue measure on R. Here the log-likelihood function 


becomes 
n 


eon Htc x, (9) =-S > |x: - 6] 


i=l 
and this is clearly maximized in 6 if and only if the sum )°, |a; — 0| is minimized. 
In other words, from Example 4.11 we conclude that the MLE is the median 


Our = 6p = med(a#1,...,2n). 


Thus within this model, the method of least absolute deviations coincides with the 
method of maximum likelihood. 


If the family is smooth, the maximizer of @ must be a stationary point of the 
log-likelihood function, i.e. it satisfies the equation 


Dé.(@uz) = 0 


or equivalently A 
S(x,Ourt) =0 (4.11) 


and this equation is known as the likelihood equation or score equation. For a sample 
(@1,...,%@n), the score equation takes the form 


n 


x S(a;,6uz) => 0. 


i=l 


A solution of the score equation is not necessarily the MLE, as a stationary point of 
the log-likelihood function may not be a local nor a global maximum. The observed 
information 


I(a, Ott) —_ —DS(«, 6x1) —_ —D?l, (Ou) 


is positive definite at the solution if and only if the solution corresponds to a local 
maximum and if this holds and the solution is unique, it must be a global maximum. 
But generally an additional argument is needed to establish that a solution of the 
score equation is also the MLE. 

In the following we shall omit the subscript of the estimator so that 6 always 
refers to the MLE unless specified otherwise. We hasten to provide some examples 
of maximum-likelihood estimation. 


Example 4.19. [MLE in Poisson model] Consider the simple Poisson model with 
unknown mean \ € A = (0, 00) and density 


fr(x) = = 8&3 xz E No. 


The associated log-likelihood function and score function is 


x 


le(A) = @logr—A, $(x,d) = 5 


1 
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so the score equation has a unique solution \ = a if and only if x > 0. For x = 0 
there is no solution within the parameter space. The information function is 

x 
which is strictly positive, ensuring that the solution corresponds to a local maximum 
of the likelihood function and since there is only one solution, it is a global maximum. 
Thus the MLE is \ = x. 


Example 4.20. [MLE in exponential distribution] Consider a sample (21,..., Un) 
from an exponential distribution with mean @ € © = (0,00), i.e. with density 


The log-likelihood and score functions become 


(0) = —nlog 0 = Soo & EZ 


and the score equation has a unique solution for 


6 


n 


The observed information is 


P nm  2Y n? 
I(x,0,) =—-= ——— >0 
en ae ge ae 


so the unique solution is indeed the MLE. Note that this estimator is also the MVUE 
as considered in Example 4.8. 


Example 4.21. [MLE in uniform model] We calculated the likelihood function in 
the uniform model in Example 1.18 and from the display in Fig. 1.2 we conclude 
that the MLE of 0 based on a sample (X1,..., X,,) is 


A 


0, = max(X1,..., Xn). 


Since this model is not smooth, we cannot obtain the estimate from the score equa- 
tion. Note that even though the MLE here is quite reasonable, we may prefer the bias 
corrected version 

n+1-, n+1 


6, = On = max(X1,...,Xn) 
n n 


as discussed further in Example 4.2. 


We conclude this section with the well-known estimation problem in the simple 
normal model. 
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Example 4.22. [MLE in the simple normal model] We consider a sample from the 
simple normal model where X,,...X, are independent and identically normally 
distributed as N’(£,07), where 0 = (£,07) = 0 = R x R, is unknown. The log- 
likelihood function becomes 


For fixed o?, this is maximized in € for En = 7, and thus 


ware n o lw (ti — Fn)? n . 
boy jaa (Enso ) = 2 logo 5) S- 5 = 5 logo = 


F oO 
w=1 


(may 


no 


Differentiation w.r.t. 7? shows that this function is maximized by 


1 —1 
On = (ti — Bn)? = a. 


n 


We note that the MLE divides the sum of squares of deviation from the mean with n 


rather than n — 1. It is customary to use the bias corrected MLE 6? = s?. 


This example is a special instance of that of a linear normal model, where the 
details and likelihood analysis is quite similar. 


Example 4.23. [Linear normal model] Consider the linear normal model as in Ex- 
ample 3.6, determined by the family of densities 


1 =||2-€||? 
f(¢o2)(#) = (Qn02)4/2 © a 


with respect to standard Lebesgue measure Avy on a d-dimensional Euclidean space 
V, where (€,07) € L x R, for L being an m-dimensional linear subspace of V. 
Ignoring irrelevant additive constants, the log-likelihood becomes 


=| = €1? 


202 


d 
l(€,07) = 5 log 0? 


for fixed o?, this is maximized in € by minimizing ||x—€||? over L. The miminimizer 
is the projection II;,(x) onto L and hence € = II,,(). Inserting this into ¢ yields the 
profile likelihood for o? 


coy dy le Me(@)IP 
20? , 


which—as in the previous example—is maximized by 


1 
6? = Sj — a (0)|?. 
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From Theorem 2.24(d) we have that ||X — II,(X)||?/o? ~ y?(d — m), where 
d = dim L. This implies that G? may be heavily biased if m is relatively large 
compared to d since 
d-—m 5 

7 a 


and it is therefore common to use the bias-corrected version 


Ex¢,02) (67) = 


1 
6? = —— |v — (a)? 


for estimating a7. But beware that & is not unbiased for a nor is @~? unbiased for 


a~?; see Exercise 4.2 . 


One important property of the method of maximum likelihood is that it is 
equivariant under reparametrizations of the model, in contrast to most other estima- 
tion methods. We formulate this important result as a theorem. 


Theorem 4.24 (Equivariance of the MLE). Consider a parametrized statistical 
model P = {Po|0 € O} with associated family of densities F = {fo|0 € O} 
and a bijective reparametrization ¢ : 0 + A so an alternative representation of the 
family is F = {g, | \ € A}. Then the MLE 6 of 6 based on an observation X = x is 
well-defined if and only the MLE 3 of is well-defined and then \ = $(0). 


Proof. This is a simple consequence of the equivariance of the likelihood function 
itself since if X = $(0) we have 


and thus 


£,(0) > ,(0),0€O — > 0,(d) > (A), AEA. 


This completes the proof. 


4.3.2. Maximum likelihood in regular exponential families 


We have seen that the MLE is always an M-estimator and in specific cases corres- 
ponds to methods of least deviations or least squares. Here we shall establish that for 
regular exponential families, the MLE is also a moment estimator. 


Theorem 4.25. Let (X1,..., Xn) be a sample from a minimal and regular exponen- 
tial family with canonical parameter 0 € © and canonical statistic t. Then the MLE 
6,, based on this sample is a moment estimator with respect to the canonical statistic 
t and moment function 


m(8) = 7(0) = V(0) = Eo(t(X)). 


If the moment equation has a solution, this solution is unique. 
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Proof. Let x = (@1,...,%) and ty = (t(a1) +--+ + t(an))/n. The log-likelihood 
function is determined as 


Z,(0) =~ 


n 


log L(6;x) = 0't, — W(6). 
Differentiation yields the score equation 
T(0) =tp 


where 7(0) = Vw(@) = Eo(t(X)), so the score equation is exactly the moment 
equation for the moment estimator based on the canonical statistic t. 

Since 7 is injective by Theorem 3.17, there is at most one point satisfying the 
score equation. Differentiating a second time yields 


D?é,(0) = —K(6) = —Vo(t(X)) = —Do. 


Since the family is minimal, Ug is positive definite; so if the score equation has a 
solution, it is a maximizer and hence the moment estimator is also the MLE. 


Example 4.26. [The gamma family] Consider again the family of gamma distribu- 
tions in Example 4.17. 

This can be represented as a minimal and regular exponential family of dimension 
2, with canonical parameters 


6 = (01,02)' = (a,1/6)", 
canonical parameter space © = (0,00)?, canonical statistic t(x) = (log z,—2x)', 
cumulant function 


(8) = log c(8) = log P'(41) — 4; log (62), 


and base measure (dx) = x~! dx on (0, 00). The exponential representation of the 


density becomes 
ef1 log +02 (—2) 


(02)-™P(01) - 


For the mean we get by differentiation of the cumulant function 


f(x; 0) = 


(0)! Ea ((log X, —X)) 


= (W(41) — log(2), -61/62)) = (U(a) + log(8), —a)) 


where Y(a) = DlogI'(a) is known as the digamma function. 
The likelihood equations or moment equations for these statistics become 


V(a)+log8=logz, af=z, 
or, equivalently, 


loga —W(a) =logz—logz; B=Z/a. (4.12) 
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The function a — log a — V(a) is strictly decreasing in a since it has derivative 
1 eel 
—-wW =— s —. <0. 
oe a (a+ i)? 


It can also be shown that for all a > 0 it holds that 


1 
aa loga — U(a) < 


Qe 


and thus the first of the equations in (4.12) has a unique solution @ with a € (0, 00) 
if and only if log x < log x which holds if and only if at least two of 71,..., 2p are 
different. Then B is determined from @ and Z via the second equation in (4.12). 

We note that although the MLE is a moment estimator it uses a different moment 
function than in Example 4.17. 


4.4 Exercises 


Exercise 4.1. Let X and Y be independent and exponentially distributed with 
E(X) = A og E(Y) = 3). Consider the following two estimators of 2: 


A\=(8X+Y)/6, A= /XY/3. 
In the following you may without proof use that [(1.5) = ['(0.5)/2 = ./7/2, where 


rw) =f uy—te~™ du 
0 


is the gamma function. 
a) Which of these estimators are unbiased estimators of X? 
b) Determine the variance for both of these estimators; 
c) Compare the variances to the Cramér—Rao lower bound and comment on the res- 
ult; 
d) Which of the estimators has the smallest mean square error? 
e) Confirm the findings above via a simulation experiment. 
Exercise 4.2. The standard estimator 
s2 = _*_ (xX)? 


n—-1¢4 
i=l 


in the normal distribution is distributed as $2 ~ o?y?(n — 1)/(n — 1) and thus 
unbiased for the variance o?, but not for the precision p? = 1/c7, nor for the standard 
deviation o, as discussed in Example 4.3. 


a) Find and compare the mean square errors of S, and the MLE 6G, = 
Sinv/(n — 1)/nas estimates of the standard deviation o. 
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b) Determine a constant a, such that 


—— 
nn aD 
82 


is an unbiased estimator of the precision p for n > 3. 
Exercise 4.3. Let X1,...,X,, be independent with means E(X;) = w+ (; and 
variances V(X;) = o?. Such a situation could, for example occur when X; are 
estimators of jy obtained from independent sources and (3; is the bias of the estimator 
Xj. 

We now consider pooling the estimators of jz into a common estimator by using 
a linear combination: 


b= wy X41, + woX_q +++: + WynXn. 


a) If the estimators are unbiased, i.e. if 3; = 0 for all 7, show that a linear combina- 
tion ji as above is unbiased if and only if }> w; = 1; 


b) In the case when {; = 0 for all 2, show that an unbiased linear combination has 


minimum variance when the weights w, are inversely proportional to the vari- 


ances 07; 


c) Show that the variance of ji for optimal weights w; is V(ji) = 1/0, 077; 


d) Next, consider the case where the estimators may be biased so we could have 
8, # 0. Find the mean square error of the optimal linear combination obtained 


above, and compare its behaviour as n — oo in the biased and unbiased case, 


when o? = o7,i=1,...,n. 


Exercise 4.4. Let X = (X1,..., Xp) be a sample of size n from the uniform distri- 
bution on the interval (1 — 6, 4 + 6) with density 


1 
Fu,5(2) = 95 bu—6,u+8) 


with respect to standard Lebesgue measure on R, where 6 = (6, 4) € 0 =R;,xR 
with 6 and y both unknown. 


a) Determine the moment estimator of 6 = (6d, :) based on t(x) = (a,x?) and 
denote the estimator by 0 = (ji, 0); 


b) Consider also the estimator 6 given by 
f= (Xayt+Xin)/2, 6 = (Xn) — Xqy)/23 


and show that fz is an unbiased estimator of 1; 


c) Consider also the estimator 
jt = med(X1,..., Xn) 


and show that also this is an unbiased estimator of wu. Hint: Use that X1,..., Xn 
has the same distribution as 4 + 6U1,...f2-+ dU, where U; is independent and 
identically uniformly distributed on (—1, 1); 
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d) Compare the variances of ji, (i, and {1 by simulation; 
e) Compare the mean square errors of 6 and 6 by simulation. 
Exercise 4.5. Let X = (X1,...,X,) be independent and Poisson distributed with 


where \ > 0 is unknown and N; are known constants. Models of this type arise, for 
example, in risk studies where N; is the number of individuals at risk in group 7, X; 
the number of events, e.g. accidents or casualties, and ) is the risk rate. 


a) Show that the above model defines a regular exponential family with canonical 
parameter 6 = log \; 


b) Identify the canonical statistic and find its mean and variance; 
c) Find the maximum likelihood estimate A of »; 
d) Find the Fisher information for \ and the variance of a 


An alternative estimator is the average rate 


X; 
1a, 


Re 


e) Show that ) is an unbiased estimator of A and determine its variance; 
f) Compare the variances of the estimators \ and } to the Cramér—Rao lower bound; 
g) How are the findings above related to those in Exercise 4.3? 


Exercise 4.6. Consider a sample Z,..., Z, of independent and identically distrib- 
uted random variables, where Z; = X; — Y; with X; and Y; independent and expo- 
nentially distributed with mean 0. 

a) Determine a moment estimator 6 of 6 based on the statistic i) =e 


b) Show that if we also had observed both of X; and Y;, the MLE would be 


n 


- 1 
0= — Xi + Yi); 
= d| +¥;) 
c) Compare the estimators 6 and 6 via a suitable simulation study. 
Exercise 4.7. Show that the estimator 6 under b) in Exercise 4.4 is a MLE. 
Exercise 4.8. Consider the family of gamma distributions with identical parameters 
for shape and scale, i.e. with densities 
78-1e-2/B 
T(6)6" ° 
with respect to standard Lebesgue measure on R4 as in Exercise 3.7. Show that the 


MLE of £ is unique and well-defined as a solution to the score equation. Hint: Use 
that the trigamma function V1(8) = D? log T({) is strictly positive on R. 


fa(@) = BER, 
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Exercise 4.9. Let X and Y be independent random variables with X Poisson dis- 
tributed with mean \ and Y exponentially distributed with rate A, where A > 0 as in 
Exercise 3.8. Determine the MLE of . and identify when it is well-defined. 


Exercise 4.10. Let X and Y be independent and exponentially distributed random 
variables with E(X) = 6 and E(Y) = 1/6 where 6 > 0 as in Exercise 3.9. De- 
termine the MLE of 6 and identify when it is well-defined. 
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Chapter 5 


Asymptotic Theory 


We consider a sample (X1,..., X»,) of n observations from a parametrized family 
of distributions P = {Ps | @ € O} and an associated estimator 0,,: 


On = An(X1,..-,Xn) 
where h,, : *” — O is measurable. 

The method of maximum likelihood is an almost universally applicable method 
that yields marvellously good estimators in a large number of cases. In most of these 
cases, however, the maximum-likelihood estimator (MLE) can only be computed 
numerically and its exact distributional properties are difficult to obtain. Therefore, 
it is important to have good and practical ways to assess the distributional properties 
approximately. 

It is common folklore that under suitable and not very strong regularity condi- 
tions, it holds that for samples of sufficiently large size n, the MLE is approximately 
normally distributed with the inverse Fisher information as approximate covariance: 


bn © Ni(O, in(8)~*). (5.1) 
Here 
in(0) = Eo(—D7en(0; X1,-..,Xn)) = Vo(Sn(X1,..+,Xnj 9) ") 


where S,, = D¢,, is the score statistic and D represents differentiation with respect 
to the parameters 0 € O C R*. 

If X,,..., Xp are independent and identically distributed, it holds that 7,,(0) = 
ni(0), where i(0) is the Fisher information in a single observation. 

In other words, the MLE is approximately an unbiased estimator and since its 
approximate variance achieves the lower bound in the Cramér—Rao inequality, it is 
generally difficult, if not impossible, to find an estimator of higher asymptotic pre- 
cision. This is usually expressed by saying that MLE is asymptotically efficient. We 
refrain from giving a precise definition of this concept and a corresponding proof. 

Similarly, it is also common folklore that ifO9 C © represents a sufficiently nice 
hypothesis about the parameters of the distribution, the maximized-likelihood ratio 
statistic A, satisfies 


SUPg9ce@_ L(0; X4,...,Xn) *8 2m — d) (5 2) 
SUPgce L(0;X4,. oe Xn) , , 


Ay = —2 log 
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where x7(m — d) is the x?-distribution with degrees of freedom m — d, where m 
is the number of free parameters needed to describe 0, and d is the number of free 
parameters needed to describe Oo. If Og = {0} consists of a single point, we let 
d = 0. In this chapter we shall rigorously establish the folklore as in (5.1) and in (5.2) 
for repeated samples in a number of cases, including curved exponential families. 
We begin by establishing some preliminary results and facts that we need for further 
developments. 


5.1 Asymptotic consistency and normality 


As we are interested in the behaviour of estimators when the sample size is 
large, we shall consider a sequence of estimators (On) nen based on samples 
(Xy,...,Xn),n © N. As mentioned earlier, it is convenient to allow the estimator 
to hit points outside of ©, as in the following simple example. 


Example 5.1. Consider a sample X = (X1,..., X7,) from the simple Poisson model 
, AEA=(0,0co). 
We have seen that a sensible estimator is 

dn = Xp = (Xi +... + Xn) /n 


but it may happen that An = 0 ¢ A. However, this happens with very low probability 
if n is large and hence we should be able to ignore it for large n. 


This motivates the following definition: 


Definition 5.2. A sequence 6, = hy,(X1,...,Xn),n € N of estimators is asymp- 
totically well-defined if there is a sequence A, C ¥” such that h,,(A,) C © and 


lim Po{(X1,...,Xp) € An} =1 
noo 


for all 6 € O. 

Clearly, we would not only like that our sequence of estimators is well-defined 
for large n, but also that it is close to the ‘true’ value of @ that has generated our 
sample. This concept is called asymptotic consistency. More precisely, we define: 
Definition 5.3. A sequence 6, = h,(X1,...,Xn),n € N of estimators is said to 
be asymptotically consistent if it is asymptotically well-defined and it holds that 

plim 6, = 6 


noo %nN 


with respect to Pg for all 9 € O. 


We may often omit the prefix ‘asymptotically’ in front of consistent when it is 
obvious from the context. In other words 6,,, € N is consistent if and only if for all 
0 € Oand e > O it holds that 


Po{\|6n — 9|| > €} 3 0 for n > ov. 
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Results about consistency of estimators are therefore based on variants of the Law of 
Large Numbers. 

Finally, we shall also be interested in quantifying how close the estimate is to the 
value of @ generating the data. To do this, we need the following. 


Definition 5.4. A sequence bn = h,(X1,...,Xn),n € N of estimators is said to 
be asymptotically normal with asymptotic mean 0 and asymptotic variance (0) /n, 
if it is asymptotically well-defined and it holds that 


Vn(On — 0) & N(0,5(6)) 
forall@d co. 


Here = denotes convergence in distribution, see Section A.2.2 for further de- 
tails on that concept. Note in particular that an asymptotically normal estimator with 
asymptotic mean and variance as above is also asymptotically consistent; see Co- 
rollary A.17. As results on consistency were based on variants of the Law of Large 
Numbers, results on asymptotic normality are based on variants of the Central Limit 
Theorem. 


5.2 Asymptotics of moment estimators 


In this section we show that under suitable regularity conditions, moment estimators 
are consistent and asymptotically normal. More precisely, we have 


Theorem 5.5. Let X1,..., Xp be independent and identically distributed observa- 
tions from a parametrized family P = {P| 0 € QO} with © C R* open. Let further 
t: X + R* be a statistic with moment function m(0) = Eo{t(X)}, where m is 
smooth and injective with det(Dm(0)) 4 0 for all 0 € ©. Then the moment estim- 
ator 6, is consistent. 


Proof. We must first establish that the estimator is asymptotically well-defined. So 
let X = (X1,..., Xn) and T, = (t(X1) +--+ +¢t(Xp))/n. The moment estimator 
is the solution to the equation 

m(0) = Th. 
Since ™ is injective, there is at most one point satisfying the moment equation. A 
solution exists if and only if T,, is in the image of m. The law of large numbers 
(Theorem A.9) ensures that 


T. m(@) for n > co 
and thus for any neighbourhood U of m(6) it holds that 
Po{T, € U} 4 1 for n > oo. 


Now the inverse function theorem (Theorem A.20) ensures that m(@) is open, and 
its inverse m~+ is well-defined and smooth in U so we conclude that 6, is asymptot- 


ically well-defined. Since we have 


6, =m—(Ty) (5.3) 
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and m—! is continuous in U, we find that 


as desired. 


We can now relatively easy establish conditions for the moment estimator to be 
asymptotically normal. More precisely, we have 


Theorem 5.6. Let (X1,..., Xn) be independent and identically distributed observa- 
tions from the family P = {P9|0 € O} with © C R* open. Let further t : X + R* 
have moment function m(@) = Eog{t(X)}, where m is smooth and injective with 
det(Dm(0)) 4 0. Assume further that the moment statistic satisfies 


E¢(||t(X)|I3) < co. 
Then the moment estimator 6,, is consistent and asymptotically normal 
On ~ Nz (0, 5(0) /n). 
with asymptotic covariance 
(8) = Dm(6)-!Vo(t(X))Dm(8)-7 /n. 
Proof. From the Central Limit Theorem (Theorem A.16), we get directly that 
Tn = (t(X1) +++ + t(Xn))/n & Ne(m(B), Vo(t(X))/n). 


Now we have On = mm (Ts), and the inverse function theorem (Theorem A.20) 
ensures that m~! is smooth with derivative satisfying 


Dm—'(m(0)) = Dm()7!. 
Thus the delta method (Theorem A.19) yields 


bn, ~ Ne (m7"(m(0)), Dm—'(m(0))Vo(t(X))Dm—!(m(0))/n) 
Ne (0, Dm(0)~'Vo(t(X)) Dm(@)~ '/n) ; 


This concludes the proof. 


Example 5.7. [Moment estimator asymptotics in simple normal model] In Ex- 
ample 4.16 we found three moment estimators Ere, 7 = 1,2,3 for an unknown mean 
based on a sample X,,..., X, from a normal distribution \V(€, 1) with known vari- 
ance equal to one. These were based on the statistics 


ti(a) =a, te(a) = 1 (6,09) (2); t3(x) = x? 
with corresponding moment functions 


mi(é) =€,  ma(£) = Pe{X > 0} = B(€),  ms(x) = 36 + &. 
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Since E¢(t;(X)*) < 00 for all these statistics and the moment functions are smooth 
with non-zero derivatives 


m,(€) = 1, my(6) = Gane OP, mh(é) = 3(1 +22), 


we conclude from Theorem 5.6 that the corresponding moment estimators are all 
asymptotically normally distributed with the correct asymptotic mean €. 
To calculate their asymptotic variance, we find the variance of the first two stat- 
istics 
Ve(ti(X)) =1, Ve(te(X)) = O(€)(1 — ®(€)), 


where we have used that 1(o,..)(X) follows a Bernoulli distribution with success 
probability ®(£). Further, for the third statistic, we get 


Ve(ts(X)) E((Z + €)°) — (Ee(X°*))? 


15 + 45é? + 15¢4 + 66 — (3¢ + €°)? = 15 + 36¢? 4+ 9674. 
It follows that the asymptotic variances are 0?(£)/n, where 


ot(é)=1, of(€) = 2nO(Q)(1- HE), of() = 


We note in particular that the last two are always larger than 0?(€) = 1 and that 
o3(€) may be huge for |€| large. For € = 2 as used in the simulations behind Fig. 4.4 


we get 
gi(2) =1,. -ee(2) = 7.88, 26 (2)= 1017s = 1.35 


which conforms well with the plots, indicating that &3 in this case is only marginally 
worse than £), whereas £g is very bad. 


Example 5.8. [Gamma model: moment estimator asymptotics] For the moment es- 
timator in Example 4.17, we get 


and thus 


a =f 2a(a+1)6 —-a 
Dm(a, 8) — oP ae > 


Further we have 


ap? 2a(a + 1)63 
2a(a+1)6? 2a(a+1)(2a + 3) 64 
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yielding the asymptotic covariance of the moment estimator 
Qmom 1 fms ae 
Va.8 ( ‘i —Dm(a, 8)~!Vo,8(t(X)) Dm(a, 8) 7 


Brom 
2(a + 1) & =p ) (5.4) 


ub me ata) pe 


There are some calculations involved in getting to this... 


5.3 Asymptotics in regular exponential families 

5.3.1 Asymptotic consistency of maximum likelihood 

In regular exponential families we can now easily establish the asymptotic properties 
of the MLE using that in this case, MLE are moment estimators. Indeed we have: 
Theorem 5.9. Let (X1,...,X;,) be a random sample from a minimal and regular 


exponential family with canonical parameter 0 © ©. Then the MLE On, is asymptot- 
ically consistent. 


Proof. This follows from Theorem 5.5 since the MLE is a moment estimator, and 
Theorem 3.17 states that 7(0) is injective and smooth and 


Dr(8) = D?w(9) = K(9) = Vo(t(X)) 


is positive definite and hence det(Dr(6)) > 0 for all 6 € O. 


We should like to point out that the consistency of the MLE has only been shown 
for samples that are independent and identically distributed and it may fail, and badly 
so, if the situation is different, as the following example illustrates. 


Example 5.10. [Double measurements] Consider the determination of the precision 
of a measuring instrument as in Example 2.30 and recall that we had independent 
observations 

XV UN Get = ea 


where €; € R is characteristic for the ith unit and o? is the precision of the instru- 
ment. Since this is a linear normal model, we readily conclude that the MLE for €, 0? 
is determined as 


o AE+Y, . 
& — = eS 1, nN; 
32 = eG OGG) = > (% — ¥;)? 
i Qn 4n : 


These estimates are not consistent: in fact, &, will for all 4 and n have variance o? /2 
and even the maximum likelihood estimate of 0? goes astray since we have that 
V(X; — Y;) = 20? and we thus get 
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which is only half of what it should be. This is one among other motivations for 
dividing the sum of squared deviations with the degrees of freedom (here 7) instead 
of the number of observations (here 27) so as to achieve a consistent estimator. 


5.3.2. Asymptotic normality of maximum likelihood 


The MLE in a regular exponential family is also asymptotically normal and its 
asymptotic variance achieves the Cramér—Rao lower bound. More precisely, we have 
Theorem 5.11. Let (X,...,X;) be a sample from a minimal and regular expo- 
nential family with canonical parameter 0 € ©. Then the MLE 0,, is asymptotically 
normally distributed with the inverse Fisher information as its asymptotic covariance 
matrix: 


bn ~ Ng (O, in(0) 1) = Ni (0, &(0)~4/n). 


Proof. Since the MLE is a moment estimator and t(X ) has moments of any order by 
Theorem 3.8, we can use Theorem 5.6 to conclude that the MLE is asymptotically 
normal. Recalling from Theorem 3.17 that 


Dr(0) = D°4(8) = K(8), 
we find the asymptotic variance to be 
(6) /n = Dr(0)~*K(6)Dr(9)-* /n = 68) *K(A)K(8)*/n = in (8), 


where i,,(0) = ni(@) is the Fisher information in a sample of size n. This concludes 
the proof. 


Note that if we wish to use Theorem 5.11, it is a problem that we do not know 6 
and therefore cannot use the asymptotic variance 7,,(9)~! to assess the accuracy of 
8,,. The standard way out is to plug-in 6,, for the unknown value and write 
bn ~ Ni.(9, in(6n)~1). 

Whereas this does not immediately make sense—as the right-hand side now is 
random—it has the following formal meaning: 


Corollary 5.12. Let X1,...,Xn be a random sample from a minimal and regular 
exponential family with canonical parameter 0 € ©. It then holds that 


ni(On)(On — 9) = N;.(0, In) for n —- 00, 
where A = VS is the unique positive definite matrix A satisfying A? = ©. 


Proof. Since 0, is consistent and A — vy A is continuous (also as a matrix function) 


it holds that \/i(6,,) Bes /i(@). Corollary A.12 in combination with Theorem 5.11 
yields the result. 
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Note that for large n the approximate variance of the MLE tends to zero at the 
rate of n~! and scaling the deviation from the true value by \/n yields convergence 
in distribution. This is not necessarily the case for non-exponential models as the 
next example shows. 


Example 5.13. [MLE in uniform model] We found in Example 1.18 that the MLE 
of the unknown parameter @ in the uniform distribution on the interval (0, 0) was 


A 


0, = max(X1,...,Xn), 


and in Example 4.2 we found that its variance was tending to zero at the speed of 
n~? i.e. much faster than in the regular case. 

This is an example where the standard asymptotics fails and the MLE is not 
asymptotically normally distributed. Indeed, for large n the distribution of the scaled 
deviation from the true value n(@— 6) is approximately exponential; more precisely, 
it holds that 


n(0 — On) % & forn > co 


where & is the exponential distribution with mean @. This may be seen in the fol- 
lowing way. Take any t > 0; then we have 


x ~ t 
lim Pp{n(0—0,) >t} = lim Py {x <d- “I 
n—- oo n—-oco mr 
1 t 
= lim (9— t/n)” = lim (1- 5) =e t/8 
n—0o gn 


which is the upper tail of the exponential distribution function. 

Note that here we must scale the deviation with n rather than \/n to obtain con- 
vergence in distribution. The approximate exponential shape of the distribution is 
also clearly visible in Fig. 4.1, even for n = 10. 


We note that an appropriate asymptotic result is valid also in any other smooth 
reparametrization of the exponential family. 


Corollary 5.14. Let (X,,...,X;) be a random sample from a minimal and regular 
exponential family with canonical parameter 0 © © and let X = (0) represent 
a reparametrization where y : © —>+ R* is injective and smooth with a regular 
Jacobian. Then the MLE Xp, is asymptotically normally distributed with the inverse 
information as its asymptotic covariance: 


as ~ Nx (A, in(A)~*) = Ni(A, (A) */n) 
where i(A) is the Fisher information about X in a single observation. 
Proof. We first note that if 4 = y(0) we have from Theorem 1.30 


i(8) = Dy(9)" #(A) Dy(9), 
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and thus 7 
i(0)~* = (Dy(8))* tA) (Dy(9)")7*. 


whence 
i(A)~? = Dy(6) «(0)-? D(A)". 


Using the delta method on A I (On) in combination with Theorem 5.11, now 
yields 


dn SS Ng (A, Dy(6)i(0)~! D8)" /n) = Ne(A, (A)7"/n), 


as required. 


Remark 5.15. We shall later—in Theorem 5.29—establish this result for any curved 
exponential family. 

Example 5.16. [The gamma family] Consider again the family of gamma distribu- 
tions as in Examples 4.17 and 4.26. Recall that this may be represented as a minimal 
and regular exponential family of dimension 2, with canonical parameters 


d= (01,02)! = (a, 1/8)", 


canonical parameter space © = (0,00)?, canonical statistic t(x) = (log a,—x)', 


cumulant function 
w(0) = log c(@) = logI'(@1) — 6; log(62), 


and base measure (dx) = x~' dz on (0,00). For the mean we differentiate the 
cumulant function 


iis a weal _ ) 7 ie 4) | 
—X —6/02 -—ap 


where Y(a) = DlogI'(a) is known as the digamma function. For the covariance 
matrix, we get by further differentiation 


water) = 0) = (“40 ete. ear (55) 


where now Wi (a) = D? logI'(a) = W’ (a) is the trigamma function satisfying 


H(2)=>- : - for z € C \ {0, -1, -2,...}. 


The likelihood equations for (a, 3) were obtained by equating the observed canon- 
ical statistics to their expectations as in (4.12). Expressing this in the canonical para- 
meters, we get 
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The asymptotic variance of this estimator is thus from (5.5) 


: -LeS.4; Jie he a o 2 
in(O)~*° = = (@) "= 5 (9) n(0,1(0,) — 1) 6 a) 


which yields the full asymptotic distribution of (6;, 42). 
If we parametrize the gamma family in terms of (a, 3), we have 


6 = $(a, B) = (a, 1/8)" 


where ¢ : (0,00)? — (0,00)? is a smooth and injective homeomorphism with Jac- 


obian 
1 0 
J(a, 8) = (' - ) 


which has full rank 2 for all 8 > 0 
We can of course also determine the information matrix and the asymptotic dis- 
tribution directly in terms of these. The information function is 


I(x;a, 8) = —D? (a, 8; 2) = ei 


Rlr 
en) 
w} 
Be 
Se 


and since E,,3(X) = a6 we get 


ite, 8) = ey s) 
B2 


ee ee 
i(a, By = aWi(a)—1 & ae 


yielding the asymptotic variance of (@, 8) when dividing by the number of obser- 
vations n. We may compare this with the asymptotic variance (5.4) of the moment 


estimator: 
a aes 
2(a + 1) : 
& ay) 


This is larger. Since it holds approximately that 


BlK 


which has inverse 


1 1 
Uv CAs ay tae a Set 
(2) a Bee 
we get 
1 

——_——— ~ 2 2 1 

a= a <2(a+1) 
and 


2(a+ 3)(aWi(a)—1)  4a(a+3) 

aW1(a) ~~ Qa+1 
It demands a little more calculation to see that the difference between the covariance 
matrices is actually positive definite. 


<i. 
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5.3.3 Likelihood ratios and quadratic forms 


As a direct consequence of Theorem 5.11 and Corollary 5.14 and basic properties 
of the multivariate normal distribution, we have a number of results concerning the 
asymptotic distribution of likelihood ratios and various quadratic forms. 

Theorem 5.17. Consider a random sample Y, = (X1,...,Xn) from a regular 
and minimally represented exponential family with © C R*. Then the log-likelihood 
statistic Ay, = Ay(X1,...,Xn, 0) satisfies 


ae) (ln(8n) = tn(8)) 2B yk) 


with respect to Pp. Here x7(k) is the y?-distribution with k degrees of freedom. 
Further, the log-likelihood ratio statistic is asymptotically equivalent to the statistics 


Wr = n(n — 0) i(0)(6, — 8), (5.6) 
Wr = 2(8n — 9)" i(On)(On — 8) (5.7) 


whereby these have the same asymptotic x?-distribution. 


Note that here and in the following, we shall generally suppress the dependence 
of A,, on data and parameters to obtain a simplify notation. 


Proof. First note that if 4 = (0) is a smooth and injective reparametrization of 
the model, the log-likelihood ratio is invariant under reparametrization, since the 
likelihood itself is. We can therefore without loss of generality assume that we work 
in the canonical parametrization. We use Taylor’s formula with remainder term in 


integral form (Theorem A.22) on the log-likelihood function £,,(9) to obtain 
ln(8) = ln(On) + Dén(n)(8 — On) + 50 — 6)" K(8,6n)(6 — 8p), 


with 


K(@,6,) 


I 


ag fo — t)D7,(0 + t(On — )) dt 


n 


2 | (t — 1)K(0 + t(O, — 0)) dt. 


We also note that K(-,-) is continuous and K (0,0) = «(0) = i(). Since 6,, satisfies 
the score equation, we have D£,,(6,,) = 0. We thus get 


An = 2(€n(8n) — &n(8)) = (On — 0)" (8, On)(On — 9). 
Now 6,, is consistent so 6,, > 6 and by continuity we also have K(6,;) s i(0). 
Since 6, ~ Ni;,(0, i(0)~!/n) we conclude that 


Wr = n(bn — 9)" i(8)(b, — 0) 5 x(k) 


which is (5.6). The asymptotic equivalences now follow from Lemma A.18. 
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Remark 5.18. Most texts would use the Lagrange form of Taylor’s theorem (The- 
orem A.23) in the above and similar proofs, leading to a remainder term with the 
quadratic form —D?0(0*) instead of K(0,6,), where 0 is between 0 and 0,,. Using 
the integral form of the remainder here and in subsequent proofs, we avoid discussing 
whether 6* can be chosen as a measurable function of x = (1,...,%n). 


We shall later—in Theorem 5.35—show a slightly more general version of The- 
orem 5.17, for curved exponential families. The proof is analogous, but we need to 
take more care, as the information function J,, is more difficult to control. 


Remark 5.19. The statistic W,, in (5.7) is known as the Wald statistic. We shall use 
the term Wald statistic for any statistic that is a quadratic form on the deviation of 
the estimator from the true value with the quadratic form being a consistent estimate 
of the asymptotic variance of the estimator, i.e. a statistic of the form 


——~ -1 


Wr = Wr = (bn —0)'En(0) (bn — 9) (5.8) 


where 6, ~ N(0,%(0)/n) and plim,,_,., En(8) = X(0). It follows from 


Lemma A.18 that any such statistic has the same asymptotic x?-distribution. 


Note also that whereas the Wald statistics are not equivariant under reparametriz- 
ation, the quadratic score is. Indeed, if we are parametrizing the exponential family 
with the mean value parameter 7 = 7(0) = Eg{t(X)}, we have fj, = T, and 
hence the Wald statistic in this parametrization is identical to the quadratic score. 
The asymptotic distribution of the quadratic score was already established in Corol- 
lary 1.35. We formulate this as its own corollary: 


Corollary 5.20. Consider a sample Y;, = (X1,..., Xn) from a regular and minim- 
ally represented exponential family of dimension k and let yn = 7(6) = Eo{t(X)} 
be the corresponding mean value parameter. Then the Wald statistic and quadratic 
score for 0 are identical and 


Wn = Qn = fin — 1)" Vn (t(X)) (fn — 0) Fnsoo x2(k)- (5.9) 


Remark 5.21. Note that since the log-likelihood ratio and quadratic score statistics 
are equivariant, it follows that these are also asymptotically equivalent in any smooth 
and injective reparametrization. 


The results may be further strengthened when submodels are considered. So we 
shall investigate the situation where P = {P9,@ € Qo} is an affine submodel of P; 
we may without loss of generality assume that Oo is represented as the image of an 
affine map as in (3.9), i.e. we have Og = AB + a where A is an injective and affine 
map and B is an open and convex subset of R%. We recall from Theorem 3.16 that P 
is again a regular exponential family with canonical statistic (x) = A't(«) and the 
maximum-likelihood estimate under Op is then determined as the moment estimator, 
i.e. as the unique solution to the equation 


E434,(A't(X)) = A'r(AB + a) = ATH(X). 
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Letting now f) = 7() and 7 = r(AG + a) denote the estimates of the mean value 
parameter in the two cases, we have f} = t(X’/) and thus 


A'(j-— 7) =0. (5.10) 


To obtain a geometric interpretation of this equation, we note that by composite dif- 
ferentiation, we have 


On Or(AB + a) 
OB OB 


where we have let 


= Dr(AB +a)A = «(AB +a)A = Xe, (5.11) 


Xp = Vagralt(X)). 
Hence, we have for all 6 € B that 


_1On 
2 fey 
A= %, OB 


Letting 6 = B we may rewrite equation (5.10) as 


Le. the residual 7} —f is orthogonal to the space spanned by the columns of 0n/OG at 
B= B with respect to the inner product determined by the inverse covariance \;' 
of the canonical statistic. We note that the space spanned by the partial derivatives is 
indeed the space of tangents to the curved surface T(AB + a). 

Letting now H, denote the hat-matrix for this projection, we have from Propos- 
ition 2.21 that 


8 _ on On" 1 On eae A(ATS A) AT (5.12) 
’—as\ap “8 ag) ap ~“e ~~ ‘ 


where we have used (5.11). This fact now leads to the following decomposition the- 
orem in analogy with Theorem 2.24 for the multivariate normal distribution: 


Theorem 5.22. Let P = {Po|@ € Qo} be a d-dimensional regular exponential 
subfamily of a regular exponential family P = {P9|0 € ©} of dimension m and 
let X1,...,Xp,... be independent and identically distributed according to some 
Pagp+a EP. 

Further, let n = t(AGB + a), and let fy, = Ty and fj, denote the maximum 
likelihood estimates of n under P and P respectively. If Hg denotes the matrix for 
the projection onto the space spanned by the columns of 0n/08 with respect to the 
inner product determined by the inverse covariance pars it holds that 


re fin — fin rs (I— Hy )Tn D (I - Hp)Z 
AUR) Blea) 3 ine) 


1B 
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Figure 5.1 — Visualization of Theorem 5.22. The image under 7 of the affine subspace 
AB-+a becomes a curved space, with tangent space (indicated by dashed lines) spanned 
by the columns of 07/0. The residual fj, — fn is orthogonal at fj, to the tangent space 
with respect to the inner product determined by o5 


where Z ~ N;(0,¥g). It holds in particular that fy, — fj, and fj, are asymptot- 
ically independent and normally distributed on appropriate subspaces of R* with 
concentrations re 


Proof. The statement about asymptotic equivalence follows directly from the delta 
method and the convergence in distribution from Corollary A.12. The last part of the 
conclusion follows from the normal decomposition theorem, Theorem 2.24. 


The situation is illustrated in Figure 5.1. We have a convenient corollary: 


Corollary 5.23. Under the assumptions in Theorem 5.22, the Wald statistics satisfy 


where Sin = K(ABn +a), and y= (On). Further, Wn, = Wp. 


Proof. This follows from Theorem 5.22 in combination with Corollary A.12 and 
Lemma A.18, since the estimates of the variances are consistent. 


Of these statistics, we would generally prefer W,,, since estimating the variance 
under the hypothesis yields a better estimate of the variance. However, we may oc- 
casionally wish to consider several submodels at the same time, for example when 
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considering which regression coefficients are zero in a multiple regression with many 
explanatory variables. Then W,, has the advantage that we do not have to calculate 
a new estimate of the covariance matrix for every submodel, but may use a single 
covariance estimate for all of them. 

The likelihood ratio statistic at the submodel behaves in a way similar to the Wald 
statistics, as we shall now show. 


Theorem 5.24. Let P = {P9|@ € Qo} be a d-dimensional regular exponential 
subfamily of a regular exponential family P = {P9|0 € ©} of dimension m and 
let X1,...,Xn,... be independent and identically distributed according to some 
Pag+a € P. Then the log-likelihood ratio statistic A, for the subfamily P satisfies 


An =2(ln(On) — Ln(n)) 3 x2(m = a) 
with respect to any Po € P. 


Proof. As before, the proof is based on a Taylor expansion. We let bn = ABn +a 
and have 


Ay = 2 (nn) = ln(6n)) 255, (vn) ss (0n)) BOY <0.) oe: 


Next we apply Taylor’s formula with the remainder term in integral form (The- 
orem A.22) on wv to get 


W(Gn) — Yn) = (6p) (On — On) + 


where we have used that 7(0,,) = ¢,, and 


K(6n,0n) = ix — 1)K(6 + t(, — 6)) dt. 


Inserting this into the first expression for A,, yields 


Now since 6, = 7~!(fin), On = T~! (Hn) are both asymptotically normally distrib- 
uted and Dr—! = «71, the delta method (Theorem A.19) gives 


Hence, 
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Since both of 6,, and 6, are consistent and K is continuous with K (0,0) = «(0), we 
have : 
K(0)-1K (On, On)«(0)~1 5 (6)? = Vo(t(X))7}. 


Now Corollary 5.23 and Lemma A.18 imply 


An = n(n — fin)" «(0) (in — fh)  72(m—- a), 


as desired. 


5.4 Asymptotics in curved exponential families 
5.4.1 Consistency of the maximum-likelihood estimator 


In the following we shall demonstrate that the asymptotic properties of the MLE, 
likelihood ratios, etc. essentially are the same for curved and linear exponential fam- 
ilies. The most difficult issue is to establish consistency of the MLE in the general 
curved case. This came essentially for free in the linear case, using the inverse func- 
tion theorem and the fact that the MLE is a moment estimator; the latter may not 
hold in the general curved case. 

Here we need to work harder to establish that the MLE is asymptotically well- 
defined and consistent. Once this has been established, the implicit function theorem 
in combination with the delta method yields relatively easily that the estimator is 
asymptotically normally distributed. 

Note first that the log-likelihood function for such a family based on a sample 
(Xy,...,X;,) has the form 


&n(8) = $(8)" Tn — ¥((8))- 
We let 
A(n, B) = $(8)'n— ¥(9(3)) 
and note that then £,,(8) = \(T;,, 3). We now have the following crucial Lemma: 


Lemma 5.25. For a curved exponential family as defined above, there exists an open 
set O with r(¢(B)) C O C (0) C R¥ and a smooth function g : O —+ B such that 


A(n, 9(n)) > A(n, 8) foralln € O,8 € BY {g(n)}- (5.13) 
Further, g satisfies the equation: 
9(7(9(8))) = B forall Be B (5.14) 


and has derivative 

Dg(n) = i(g(n))*J(g(n))" (5.15) 
In particular, if T,, € O, the likelihood function attains its unique maximum over B 
at By = g(T»). 


Proof. The proof is somewhat technical and deferred to Appendix B.2. 
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In most literature, the first part of this lemma is stated as an assumption rather 
than a fact which essentially provides what you need, as the lemma gives no guidance 
in a specific case, where we typically need to establish (5.13) by other means. It just 
guarantees that our efforts in doing so will not be in vain for large enough samples. 

The relation (5.14) is usually referred to as Fisher consistency. We can now show 
that the MLE is asymptotically consistent in a curved exponential family. 

Theorem 5.26. Let X1,..., Xn be a sample from a curved exponential family with 
parameter space B. Then the maximum likelihood estimator By is asymptotically 
well-defined as a solution to the score equation 


S(X,..+.Xni 8)" =nSTh, 8)" =nJ(B)" (F,—7(¢(8))) =0. 6.16) 


Further, the MLE is asymptotically consistent and Fisher consistent so Bn = idee 


where g(1(9(8))) = B. 


Proof. Since O in Lemma 5.25 is open, there is a neighbourhood U with 
7($(8)) €U CO. 
The law of large numbers ensures that T;, T(@(8)) and thus 


But for T,, € U, Lemma 5.25 ensures that Bn is well-defined, maximizes the likeli- 
hood function, and solves the score equation (5.16). Thus the MLE is asymptotically 
well-defined and asymptotically consistent. The Fisher consistency is (5.14). 


The MLE is well-defined with a probability that tends to one when n becomes 
large. However, in any specific case one must verify whether the likelihood function 
actually has a unique maximum for the actual observed value T;, = tn, i.e. check 
whether in fact it holds that t,, € O. 


Example 5.27. [Continuation of Example 3.28] Consider the case when the mean of 
a bivariate normal distribution is assumed to be located on a semi-circle in the right 
half-plane. The score function is 


Sn(8) = nJ(8)" (tn — 7(6(8))) 
= n(—sin £,cos £) & = | 


Lo — sin B 
= n(%2cos 8 — £sIin 8). 
For £1 = Z2 = 0 any (* in the interval B = (—7/2, 7/2) satisfies the score equa- 
tion, and for 1 = 0 £ £2 the score equation does not have a solution in this interval. 
If Z, # 0, the score equation has a unique solution 3* 


B* =tan~'(Z2/71) 


since tan(3) = sin 3/ cos f is a bijection from B to R. 
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> 1 > 71 


T(¢(B)) T(¢(B)) 


Figure 5.2 — The curved subfamily in Example 5.27 determined by the mean being on 
a semi-circle. The picture to the left shows a situation where the MLE does not exist 
and the solution to score equation maximizes the distance to the curve, hence minimizes 
the likelihood. In the right-most picture, the MLE exists and is determined as the point 
of intersection of the line to Z, and the semicircle. Note that the residual 7, — 7jn is 
orthogonal to the tangent of the circle at the MLE. 


Note that G* may be interpreted as the angle (measured with sign) between the 
abscissa and the vector from (0,0) to the observation (#1, %2). This implies that we 
have 


cos 3* = |%1|/R, sin B* = sgn(Z1)%2/R, (5.17) 


where R = ||z|| = \/Z7 +73 is the length of the observation. Differentiating a 
second time and changing sign yields the information function 


[,(B) = —S;,(B) = n(%1 cos B + Z2 sin 8) 


so if 8* solves the score equation we get 


x _ |& _ sgn(x,)x 
a2 4 22 
= nsgn(Z1)—2 a =nsgn(z1)R 


and hence the solution only corresponds to a maximum if Z; > 0, i.e. if the observed 
canonical statistic (%1, Z2) is in the right half-plane. If z; < 0, the solution is a min- 
imum, so the MLE is not well-defined. These situations are illustrated in Figure 5.2. 


Thus, in this example the set O in Lemma 5.25 is the right half plane (0,00) x R 
and the MLE is only well-defined if the observed value of Z is located there. Note 
that this will be happen with high probability if n is sufficiently large, although this 
probability depends drastically on £. For 6 close to either of —7/2 or 7/2, n has to 
be massive to ensure this happens. 
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Example 5.28. [Continuation of Example 3.29] Consider a sample of independ- 
ent and identically distributed random variables X1,...,X;, with X; ~ N(6, 87), 
where { € B = (0, 00) is unknown. The log-likelihood function becomes 


Ss) « 94 
262 —B 
where S;, = >>;_, X; and SS, = )>i_, X? are the sum and the sum of squares of 


the observations, corresponding to the canonical statistics in the larger family of all 
normal distributions. So here 


B _ (Taian _ (SaimSSainy” 


and the score statistic becomes 


ze Pi Tie AD 
Sa(Fm8) =n (Fe 33 5) 


Hence, the score equation is equivalent to the equation 


B? + TinB >; Ton =0 


with roots 
—Tin rs eka + AT on 
ma 2 
of which exactly one—corresponding to the plus sign—is positive unless it holds 
that X; = X2 =--: = X,, = 0, which happens with probability 0. The information 
function is ee E 
n n n 
I,(8) = 3 Ba 253 BD 


As there is only one stationary point, Bn is a global maximum if and only if the 
observed information J, (Bn) > 0. But if 3, is a solution to the score equation, we 
have SS, = SnBn + nz and therefore the observed information [,,( Bn) at the 
unique solution to the score equation satisfies 
Brtn(Bn) = 388m —28nBn — Br 
39'S —25'Sn +n62 = SS +nB2 > 0. 


We thus conclude that the solution to the score equation is the MLE so 


es =Tin ae \/ ie a AT on 
Bn = : (5.18) 


2 


In other words, we have in this example that O = t(R) = {(x, y) € R?| y > 0} and 
P3(O) = 1, so the MLE is well-defined almost surely. 
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5.4.2. Asymptotic normality in curved exponential families 


To establish asymptotic normality of the MLE, we shall again use the delta method. 
For this, we exploit that we found the derivatives of 6,, = g(t) in (5.15). 


Theorem 5.29, Let (X1,...,X») be a sample from a curved exponential family of 
dimension m with parameter 8 € B. Then the MLE £,, is asymptotically normally 
distributed with the inverse information as its asymptotic covariance matrix: 


A as : = 1 : = 
By BNn(8sin(8)Y) = Nn (8, 2i(8)-*) 
where i(3) is the Fisher information about f for a single observation. 


Proof. The central limit theorem implies that T,, ~ Ny.(7((8)), (0(8))/n). Since 
we have established in Theorem 5.26 that 3,, = g(T;,), where g is smooth, we can 
use the delta method (Theorem A.19) to conclude that 


By 8Nin (a(r((8))), = Da(r(B)))x(4(5))Dalr(6(8))" ) 


so if we let 7 = 7(¢(8)), we get by Fisher consistency that g(7) = 6, which yields 
the asymptotic mean. Using (5.15) for the derivative Dg, we get for the asymptotic 
covariance with 7 = 7(¢(8)) and thus g(7) = 8 


Dg(n)«(6(8))Dg(n)" = i(g(n))~*T(g(n)) " &(6(8)) F(g(n))i(g(n))~* 


and the result follows. 


For illustration, we again consider the bivariate normal distribution with mean on 
a semi-circle. 


Example 5.30. [Continuation of Example 5.27] Consider again the case when the 
mean of a bivariate normal distribution is assumed to be located on a semi-circle in 
the right half-plane. We found the information function to be 


In(B) = n(X1 cos B + Xo sin 8) 
yielding the Fisher information 
in(B) = Eg(In(8)) = n(cos? B+ sin? 8B) =n 


so we have 


By = tan~!(X2/X 1) ~ N(B,1/n). 


Note in particular that the asymptotic variance is constant in (@. 
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Example 5.31. [Continuation of Example 5.28] For the model of constant coefficient 
of variation in the normal distribution, we had 


SS, S 
In(B) = 8 — 25 z 


and by taking expectations we get 


so we conclude that 8, ~ N (8, 87/3n). This could, of course, have been derived 
directly, using the delta method on expression (5.18); but the point is that we can 
avoid this work by using Theorem 5.29. 


5.4.3 Geometric interpretation of the score equation 


As for the case of an affine subfamily, it is worth interpreting the score equation 
geometrically. By composite differentiation, we have 


Or(9(B)) _ 
epg = (O68) 18) 
and thus 
= _1 97(¢(B)) _ 1 97(9(8)) 
J(B) = (608) "SE = Ei) — ag 


where Ng = Vo{t(X)}. Hence we can rewrite the score equation for 7, = T(O(8n)) 
as 


i = a o = 
J(Bn)' (tr — fin) =O <> eee wa, (en —fin)=0 (5.19) 


=Pn 


which expresses that the residual tp, — fn is orthogonal to the tangent space at fj with 


respect to the inner product determined by ir y" 


The tangent space 7(() is the affine space around 7(¢(3)) spanned by the 
vectors 


p= lpeeeg 


Ir (9(8)) ene), 


7,(0(@)) = (THEY Se 


i.e. all vectors of the form 


T(v, 8) = 7(9(8)) +) vjT;(9(B)), v ER”. 


For later use we also consider the function h : O + 7(Q), where O is the open 
set in Lemma 5.25 and 
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Note that then 7 7 
fin = T(O(Bn)) = h(tn) 
is the MLE of 7(¢(8)) = Egg)(t(X)). By composite differentiation, we find 


7(9(8)) = AAO) — (6(6)) 18) (5.20) 
and further, using (5.15): 
Dh(n) = «(O(9(n))) J(g(n))i(g(n)) "J (g(m)) " = H(n). (5.21) 


The matrix H(7) is often referred to as the hat-matrix, because of the following: 


Lemma 5.32. For n = 1(@(()), the hat-matrix H(n) = Dh(n) represents the 
orthogonal projection onto T*(3) = T(G) — 1) with respect to the inner product 
determined by «(¢(8))~+ = Daa)" 
Proof. We shall write H for H(n), J for J(g(n)), etc., and let © = «. Now we use 
Proposition 2.21 with A = J by (5.20), K = 7, andi = J' SJ to get 
Hs BIW a= Sys SA) ks" 
A(A'KA)1J'SK = A(A'KA)“1(0J)'K = A(A'KA)1A1K 


so #7 is indeed the matrix for the orthogonal projection. 


We shall illustrate this geometric result in the model of fixed coefficient of vari- 
ation. 


Example 5.33. [Continuation of Example 5.28] Assume that we have observed t,, = 
(0.5, 3) based on n = 10 observations leading by (5.18) to the MLE 


A. = ees VES 1 oie 


The estimation is illustrated in Fig. 5.3. The picture does not immediate show 
orthogonality, but the inner product is also not the standard Euclidean one. The cov- 
ariance of the canonical statistic on a point of the curve is 


xX Bp? 26° 
weevil) f= (oe or) 


So the estimated precision at 7), = (1.5,4.5) | is 


e Ags 1 
Silla BP 2ps\ ep OR _ 1 (108-24) 
B 263 664 =} oa 81 \-24 18 
The tangent direction is obtained by differentiating (3,267)' to yield (1,48)" — 
(1,6) ' so the inner product between the residual and the tangent direction is 


1 108 —24 1 
—(—1,-1.5 = (—108 + 36 + 144 — 72)/81 =0 
= ) & a (') )/ 


as it should be. 
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7(6(8)) = (..) 
r(¢(B)) 


> M1 


Figure 5.3 — Geometric illustration of maximum-likelihood estimation in the model 
with constant coefficient of variation. We have 7(¢(8)) = (8, 267)' and assume t, = 
(0.5, 3)", leading to B = 1.5. The tangent to the curve at the MLE is indicated with 
a thick line. The residual is orthogonal to the tangent with respect to the inner product 
determined by the inverse covariance. 


5.4.4 Likelihood ratios and quadratic forms 
Based on Lemma 5.32, we obtain the following for the joint distribution of the resid- 
ual and estimate: 


Corollary 5.34. Let (X,,...,X;) be a sample from a curved exponential family 
of dimension m and order k with parameter 8 € B. Then the Wald statistic Wp, 
is the squared norm of the residual with respect to the inner product determined by 


K(9(8))~! = Daa) 

Wr = (In ~ fin)" 6(6(B)) "(In — fin): 
Also, W,, is asymptotically independent of Bn and W, Ess x7(k — m) forn > oo. 
Proof. The delta method (Theorem A.19) yields, using the abbreviations above: 


De es Ain as I-H)Z 
die fin\ as { ( ) 
tn — 1 AZ 
where Z ~ Ni(7, «((8)), and the normal decomposition theorem — Theorem 2.24 


— thus yields that #,, is asymptotically independent of the difference T,, — fn and 
Ww, 3 x2(k — m). As Bn = g(fn), the result follows. 


Compare also to Corollary 5.20 to see that W,, here is closely related to the quadratic 
score statistics. 

As in the case of a regular exponential family, we can readily extend The- 
orem 5.29 to a result on quadratic forms. 
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Theorem 5.35. [fP = {P4 a), € B} is a curved exponential family of dimension 
m, the log-likelihood ratio statistic Mp, satisfies 


An = 2 (Cn(Bu) — &n(B)) 3 x?(m) 


with respect to Pg(g), where x?(m) is the y?-distribution with m degrees of freedom. 
In addition it holds that the Wald statistic 


has an asymptotic x?(m)-distribution. 


Proof. The proof is almost identical to the proof of Theorem 5.17 and differs only in 
the handling of the information function. We apply Taylor’s formula with remainder 
term in integral form (Theorem A.22) to the function @,,(() to obtain 


where 
Bn oi ‘ 
K (GPx) = = f (1-1) D*%eq(B + Bn — B))at 
1 = A 
= 2f (=I) +# Gn p))at. 
0 
where D?¢,,(3) = —I,,(8) where I,, is the information function. Since ,, satisfies 


the score equation, we have D@,, ( Bn) = 0 and we thus get 


In contrast to the proof of Theorem 5.17, the average information function may not 
be equal to the Fisher information so we may have I, 4 i. However, from (B.4) we 
have the following explicit form of the information function: 


T,(B) = In(B)/n = 1(B) — S~ D?du(8) (tun — Tu($(8)))- (5.22) 


uU 


But Z,, is consistent so B,, ba B and as t, = 7(¢(8)), the second term in (5.22) 
converges in probability to zero, so 


A 


i 
Pliny sao K(3,8n) = plimy 5.02 f (t= 1il3-+ 4B, ~ B)) dé =0. 
0 
Since By ~ Nn(G, i(8)~!/n), it holds that 


n(Bn — 8)" i(B)(Bn — B) 3 x2(m). 


Corollary A.12 in combination with Lemma A.18 yields the conclusion. 
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| 
T($(y(A))) 


Figure 5.4 — Visualization of maximum-likelihood estimation in the situation where a 
system of nested models is considered. The residuals t, — 7 and tn — Hn are orthogonal 
to the tangent spaces 7(8,) and 7 (Gn) with respect to the inner products determined 


» 


by the Fisher information at #,. and 7jn. 


We next consider the case where the likelihood ratio statistic is comparing a given 
family to a curved subfamily; more precisely, a submodel of the form 


Ho: B € (A) 


where A C IR? is open and y : A + Bisa smooth function satisfying the conditions 
ensuring P = {P%4(q)),@ € a} to be a curved exponential subfamily of a larger 
curved exponential family 


P = {Pocy(a))& € A} C P = {Pyiay, B € BY C { Po, 0 € 8}, 


Note the larger family ? may itself be a regular exponential family and the rep- 
resentation as a curved family only due to reparametrization. 

We now let fj, = (tn) = T(6(Bn)) and fin = h(tn) = T($(7(Gn))) denote 
the MLE of the mean value parameter under P and P, respectively. The situation is 
described in Figure 5.4. Compare also with the linear normal case, as illustrated in 
Figure 2.2. Further, we let H(m) = Dh(n) and H(n) = Dh(n) for n = 7(#(y(a))) 
be the corresponding hat-matrices as in Lemma 5.32. We have the following lemma: 


Lemma 5.36. The tangent space T(a) at n = T(¢(y(a))) is a subspace of T(3) 
and we have 


(I— H)(H — H) =(I- H)H =0 (5.23) 
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Proof. We get by composite differentiation that 


Ti(o(a)) = Eo) = Ente 5) a) 


and hence the vectors T;(7(a)),i = 1,...,d spanning 7 (q) are all linear combina- 
tions of vectors T;(¢(8)), 7 =1,...,m, spanning 7((). 

Since H and H represent orthogonal projections onto these spaces, we have 
H? = H, H? = H,as wellas HH = HH = H; hence (5.23) follows. 


We then have the following asymptotic version of the main decomposition theorem 
Theorem 2.24: 


Theorem 5.37. Let P = {Po(y(a)), € A} be a d-dimensional curved subfamily 
of a curved exponential family P = {Py g),8 € B} of dimension m and order 
k; and let X1,...,Xn,... be independent and identically distributed according to 
some P = Pg yay) € P. Further, let n = 7($(y(a))). Then 


T, — fn G-H)r (I-H)Z 
Vn| | —fin | 2 Val) (#-WT,| 3 |(7-Hz], 
Ain — 1) HT, —7 ne 


where Z ~ Nj(0,%) for & = «($(y(q@))). It holds in particular that the quant- 
ities Ty, — Tins tn — Mrs tn are asymptotically mutually independent and normally 
distributed on appropriate subspaces of R® with concentrations S~'. 


Proof. The statement about asymptotic equivalence follows directly from the delta 
method. The remaining part of the conclusion follows from the normal decomposi- 
tion theorem, using Lemma 5.36. 


We have a convenient corollary: 


Corollary 5.38. Under the assumptions in Theorem 5.37, the Wald statistics are 
asymptotically equivalent and satisfy 


where Ss, = k(6(7(@n))) and SS ($(Bn)) 


Proof. This follows from Theorem 5.37 and the normal decomposition theorem 
combined with Lemma A.18, since the variance estimates are consistent. 
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We are now ready to show the main result concerning the asymptotic distribution 
of the likelihood ratio. We define the maximized log-likelihood ratio statistic as 


sup geq(A) L(8; X1,---,Xn) 

supgep L(B; X1,...,Xn) 
GCA. Sra) 
BGng i943 he) 


Ag = 2 log 


L 
a 2 log 


where @,, is the MLE of a in Pp and Bn the MLE of ( in P. We then have 


Theorem 5.39 (Wilks). Let P = {Po(y(a)),@ © A} be a d-dimensional curved 
subfamily of a curved exponential family P = {Pa g),8 € B} of dimension m. 
Then the log-likelihood ratio statistic A, for the subfamily (A) satisfies 


An 2 x7 (m ir d) 


with respect to any P € P. 


Proof. As usual, the proof is based on Taylor expansion. We let 6, = (Bn) and 
6 = 6(7(Gn)). We then have 


An = 2(€,)— €(4n)) =2n (Yn) — (bn) + 2n(On — O) te: 


Next we apply Taylor’s formula with remainder in integral form (Theorem A.22) on 
w to get 


bn) = Gn) = 7(n)™ On — Bn) + 5 Bn — Bn) K Bn bn) (Gn — Bn) 
ss Bi Oo ag +n ~ On) K (bn, On)(On — On) 


where 


Inserting this into the expression above yields 
An = 2n(6n — On)" (En — fin) + (On — On) K (On, On)(On — On). 


Now since 6, = 77!(fin)s On = 77" (fm) are both asymptotically normally distrib- 
uted and Dr—! = «71, the delta method (Theorem A.19) gives 


Vn(On _ On) = Vik (fin - fn) 
where ko = &((7(q@))); thus we can express everything in 7-terms as 


An = 2n(Tn => Ga) Bo (oe = tn) om n(tn a: fin) Ky K(8n, Onde Cin = tn): 
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But we now get from Theorem 5.37 that 
2n(fin — fin) Ko (in — fin) = ITT (A -— H)TSO I HT, 


= %wT)u-'(H — H)\(I- H)T, =0 


where we have used that H and H are self-adjoint and (H — H)(I — H) = 0 by 
Lemma 5.36, so we conclude 


as A i“ 


An = 2(fn — fn)" oy K (Bn; On) ko (fin — fin): 
Since both of 6, and a: are consistent, we conclude that 
Kot K (On, On) = ere? ae 
Now Lemma A.18 and Corollary A.12 imply 


A, 3 x2(m— 4), 


as desired. 


We emphasize that although the limiting distribution of A, is x?(m — d) for 
all P € P, the quality of approximating the true distribution with the asymptotic 
distribution can be very different for different P € P. 


5.5 More about asymptotics 


The classical asymptotic results for MLE were established by Cramér (1946) and 
Wald (1949). Cramér (1946) shows under regularity conditions — which include 
smoothness assumptions and majorized boundedness of third-order derivatives of 
the likelihood function — that there is a consistent solution to the score equation 
and that this solution is asymptotically normally distributed with the inverse Fisher 
information as its asymptotic covariance. For the sake of completeness, we shall here 
state Cramér’s theorem in full but without proof. 


Theorem 5.40 (Cramér). Consider a sample X1,..., Xx froma smooth and locally 


stable family with parameter space © C R*. Assume further that for every 0 € © 
there is a neighbourhood Ug so the family of densities satisfy 


63 
p= Nh = <H, line U, 
remem Fn) Reena els 


where sup,cu, En{Ho(X)} < Mo < co. Then it holds for every neighbourhood Vo 
of 6 that 


lim Po{FOn € Vo: S(X1,...,Xns4n) = 0} = 1. 


In other words, the probability that the score equation has a root 6,, near 0 tends to 
unity. Further, this root is asymptotically distributed as 0, ~ N;,(0, (ni(0))~*). 
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Note that there is no guarantee that the solution to the score equation maximizes 
the likelihood — see Example 5.27 — and in specific cases the regularity conditions 
may either be hard to verify or they may be violated. However, it does follow from 
Lemma B.2 that regular exponential families satisfy the conditions in the theorem 
and this is also easy to show for curved exponential families. So the conclusion in 
Cramér’s theorem holds for curved exponential families. 

Wald (1949) shows under a number of regularity conditions—involving uniform 
convergence of log-likelihood functions to their expectation and other (non-smooth) 
conditions on the parameter spaces—that the MLE is asymptotically well-defined 
and consistent, without smoothness assumptions, nor smoothness results. And again 
this applies to curved exponential families, but we have established this specifically, 
in Theorem 5.26. 

The asymptotic theory for curved exponential families was originally developed 
by Andersen (1969) and Berk (1972). As the authors above, we have only considered 
the case where the sample consists of independent and identically distributed obser- 
vations; but the results turn out to be applicable well beyond that case although it is 
hard to give a precise general set of conditions that is easy to verify in a given case. 

Indeed, the important condition for the results to hold seems to be—apart from 
smoothness conditions—that the Fisher information 1,,(8) tends to infinity in a way 
that is reasonably uniform in 3; this holds in the i.i.d. case because i,(3) = ni(3), 
where i((3) is the Fisher information in a single observation. The reason for this is 
that we need a central limit theorem for the score statistic of the form 


/in(8)5n( Xn, 0) > Nz(0, In). 


There are many variants of the central limit theorem without assuming that the score 
statistic is the sum of 1.i.d. random variables. In the non i.i.d. case, the asymptotic 
results may often be true as well and may occasionally be confirmed by simulation. 


Example 5.41. [Poisson asymptotics] As an example of this, let us consider a Pois- 
son random variable X with unknown mean \ € R. We have seen in Example 4.19 
that the MLE of \ is \ = X and in Example 1.26 we found that the Fisher inform- 
ation in the Poisson model was i(\) = A~+. Now let us assume that \ varies with 
n so that A,, = ngs and reparametrize with pp = 2,,/n so that fi, = X/n and the 
Fisher information about js becomes 7,,(j:) = nyo~! and we note that 7,,(4) — 00 
forn > oo. 

But X has the same distribution as X = Y; +---+ Y,, where Yj,...,Y,, are 
independent and identically Poisson distributed with parameter up € R. Thus we 
conclude from the asymptotic results that 


as 1 
fin = X/n~N (« =i) forn + ov. 
n 


or, equivalently, that 


dn = X~N (Ans An) forn > oo. 
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We may now phrase this as 
\=X%NN(,,A) for 4 > 00 
with the formal meaning that 
1 i D 
moe = N(0, 1) for \ + 00 


thus establishing asymptotic normality of the MLE for large \. Hence, if A is large, 
we may in effect treat the case as if we have a large number of observations. 


For state of the art in asymptotic statistical theory, the reader is referred to van der 
Vaart (2012) who also develops the asymptotic theory for general M-estimators, in- 
cluding the MLE. 


5.6 Exercises 


Exercise 5.1. Let X),...,Xn,... and Y1,...,Yn,... be a sequence of real-valued 
random variables so that plim,,_,,, Xn = 0 and |Y,| < C, where C is a positive 
constant. Show that plim,,_,,, XnYn = 0. 


Exercise 5.2. Consider X, = (Yi +---+Y,,)/n, where Y;,..., Y;, are independent 
and identically Poisson distributed with parameter  € R+ so that 


Xn =v (u,). 
n 


Use the delta method to show that 


and contrast this with the fact that X;,! does not have finite variance. Explain what 
is going on here. This is an example where the asymptotic variance is not a limit of 
variances, but the variance in the limiting distribution. 


Exercise 5.3. Consider the estimator 
2 1 v\2 


n—-1¢4 
i=l 


of the variance o? in the simple standard normal model, so that S$? is distributed as 
$2 ~ o2y°(n—1)/(n— 1). 
a) Show that $? is asymptotically normally distributed with 


SAN [o? a 
n ’ n ba 


b) Consider the transformation Y,, = log $2. Use the delta method to find the 
asymptotic distribution of Y,,. 
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Exercise 5.4. Consider the simple Bernoulli model where X1,..., X,, are independ- 
ent and identically distributed with 
P(X; = 1) =1- P(X =0) = p 


with ys € (0,1) being unknown. 
a) Show that the MLE of pu is ji, = Xp and that this is asymptotically normally 


distributed with 
Pee) UL 1— Le 
fin ~ N @ oe) . 
n 
b) Consider the transformation Y,, = sin~!(./ X;,). Use the delta method to find the 
asymptotic distribution of Y,,. 


Exercise 5.5. Consider a sample Z,,..., Z,, of independent and identically distrib- 
uted random variables, where 7; = X;— Y; with X; and Y; independent and expo- 
nentially distributed with mean 6. 


a) Show that Eg(Z}) = 2467; 
b) Find the asymptotic distribution of the moment estimator 6, of @ based on the 
statistic t(z) = 27; 
c) Compare the asymptotic distribution of 6, to that of 
‘x a age 
On, nr X; Y; 
= du +Y;) 
which would be the MLE of @ had X,,...,X, and Yj,...,Y, also been ob- 
served. 


Exercise 5.6. Consider the family of Pareto distributions with densities 


fa(x) = at~9~"1¢1,.0)(2) 
with respect to standard Lebesgue measure on R and assume a > 2 but unknown. 
Consider a sample X1,..., X,, from this distribution. 
a) Determine a moment estimator @,, for a based on the statistic t(a) = x and the 
sample X1,..., Xn; 

b) Find the asymptotic distribution of this moment estimator; 
c) Show that the MLE 4, of a based on the sample X1,..., X, is given as 

nm . 
dint log Xi! 
d) Find the asymptotic distribution of 4,,; 


An = 


e) Compare the estimators @, and Qn. 
Exercise 5.7. Consider the family of distributions with densities 


fo(x) = 62° Neo py(z), 9@€O=R, 


with respect to standard Lebesgue measure on R and consider a sample Xj,..., Xn 
from this distribution. 
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a) Determine a moment estimator 6,, for 0 based on the statistic t(x) = x; 
b) Find the asymptotic distribution of this moment estimator; 


c) Show that the family is a regular one-dimensional exponential family and identify 
the canonical parameter, canonical statistic, and cumulant function; 


d) Show that the MLE of 6 based on a sample X1,..., X» is given as 


“A —n 


In = Sa 
Yj} log Xi 


e) Find the asymptotic distribution of bn: 
f) Compare the estimators On and 6n- 


Exercise 5.8. Consider the family of gamma distributions with identical parameters 
for shape and scale, i.e. with densities 


(2) 7b-1e-2/B 

Fi vt) = ’ B Ry 

; T(B)B° 

with respect to standard Lebesgue measure on R,, and let X,,...,X,, be a sample 


from this distribution. 


a) Determine a moment estimator (,, for 3 based on the statistic t(a) = x and the 
sample X;,..., X» and find its asymptotic distribution. 


b) Show that a moment estimator Bn for 3 based on the statistic t(z) = x? and the 
sample X1,..., X» is well-defined and find its asymptotic distribution. 


c) Find the asymptotic distribution of MLE Bn of 6 based on the sample 
X1,...,Xn. 


d) Compare the asymptotic distributions of the estimators Bes Bns and Bn 


Exercise 5.9. Let X and Y be independent random variables with X Poissons dis- 
tributed with mean \ and Y exponentially distributed with rate A, where \ > 0, and 
let (X1, Yi),.--,; (Xn, Yn) be a sample from this distribution. 


a) Determine a moment estimator \,, for \ based on the statistic t(a,y) =“-y 
and the sample (X1, Y1),...,(Xn, Yn) and find its asymptotic distribution. 


b) Find the asymptotic distribution of the MLE ye OED: 
c) Compare the asymptotic distributions of the estimators An and Ap. 


Exercise 5.10. Let X and Y be independent and exponentially distributed random 
variables with E(X) = 6 and E(Y) = 1/8 where 6 > 0 as in Exercise 3.9 and 
Exercise 4.10, and let (X 1, Y1),...,(Xn,Y;,) be a sample from this distribution. 


a) Determine a moment estimator Bn for 3 based on the statistic t(a,y) = x — y 
and the sample (X1, Y1),..., (Xn, Yn) and find its asymptotic distribution; 


b) Find the asymptotic distribution of the MLE Bn of ( based on the sample 
(X41, Yi), sicuely (Xn Ya 


c) Compare the asymptotic distributions of the estimators By and By. 
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Exercise 5.11. Consider the situation in Exercise 3.6, where (X,Y) are random 
variables taking values in R, x No with density 


1 y 
fo (@y) = so (B2)" .-62 oS 0;, y= 0,134: (5.24) 
y! 
with respect to v x m, where v is the standard Lebesgue measure on Ry and m 
is counting measure on No. Consider now n independent and identically distributed 
observations (X1,Yi),...,(Xn, Yn), where (X;, Y;)has density formen (5.24) and 
(A, 8) € R4. are unknown. 


a) Determine the MLE for (A, 3) and identify when it is well-defined; 
b) Find the asymptotic distribution of the MLE ( ae Ba); 


c) Determine the moment estimator (\,,8,) for (A,) based on the function 
(X,Y) =(X, XY); 


d) Find the asymptotic distribution of the moment estimator By and compare it to 
the asymptotic distribution of (,,. 
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Chapter 6 


Set Estimation 


6.1 Basic issues and definition 


Estimation, as described in Chapter 4, yields methods for ‘guessing’ the unknown 
parameter @ involved in generating the observation x. But it fails to give an indication 
of the precision of that estimate. One way of indicating precision is by giving an set 
estimate rather than a point estimate. 

More precisely, if © C R*, we let C be acollection of subsets of R*. For example, 
C could be the set of spheres in R*. For k = 1 we would typically let C be the set of 
open intervals, indicated by their endpoints. 

So consider as usual a parametrized statistical model with representation space 
(¥, E) and associated family P = { Ps | 0 € O}. We define 


Definition 6.1. A set estimator is a map C : X ++ C with the property that the 
induced maps 


rH 1e(z) (A) 


are measurable for all 0 € O. Its coverage cp is given as 
co = Eo{1c(x)(9)} = Po {C(X) 3 6} F 


We say that C’(X) is an 1 — a confidence set or confidence region if the coverage is 
at least 1 — a, i.e. ifcg > 1—aforalld€ O. 


Note that it is the set C(X) that is random and not 6. The ‘confidence’ does not 
say anything about how likely @ is once x is observed, but if a = 0.05, say, it says 
that the method for producing our confidence set will include the true value 95% of 
the time. Or rather about 5% of our confidence sets will be wrong, i.e. not include the 
true value. The distinction is subtle and it is quite common to get this wrong when 
referring to statistical investigations, for example in the popular press. 


6.2 Exact confidence regions by pivots 


Consider a parametrized statistical model P = {P| € ©} with representation 
space (¥,E) and let (Y,F) be a measurable space. We shall consider functions of 
the form R: Y x O++ ¥ satisfying 


R(-,0) : ¥ + Vis measurable for all 6 € O. 
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Definition 6.2. A function R : ¥ x 0 +> J as above is a pivot if it holds for all 
B € F that Po{ R(X, 0) € B} does not depend on 6. 


Pivots are plenty, and the pivot below is in some sense universal: 


Example 6.3. [The universal pivot] If Y = t(X) is a real-valued random variable 
with distribution function F%, i.e. 


PY < y} = Foly) 
that is continuous on the support of t( Ps), the function 

Ro(X, 0) = Fo(t(X)) 
is indeed such a pivot, since 


Po{Ro(t(X),8) < t} = Py {¥ < inf{y| Fo(y) >} =4 


so F(Y) is uniformly distributed on the unit interval for all 0 € O. 


The universal pivot in Example 6.3 can in principle be used to construct a set 
estimate as follows. Choose any B C (0,1) with Lebesgue measure \(B) = 1— a. 
Next, let 

Cp(ax) = {0| Fo(t(x)) € Bh. 


Then, clearly, Cp(X) is a set estimator with coverage 1 — a. For some B this might 
be sensible, but in general there are too many different possibilities for B, and it 
might be impossible or difficult to compute C'g(x) for a given x and B. 

Example 6.4. [Universal pivot for the normal distribution] Let X = x be an obser- 
vation from a normal distribution on R with mean € € R and variance o? > 0. The 
distribution function is thus 


oO 


R(a; €,07) = Fzo2(z) = © (=) ; 


where © is the distribution function for \V(0, 1). To ensure coverage 1 — a, we may 
for example choose 
B= Bs = (6a,1— (1 — d)a) 


for any 6 € (0, 1), since then 
A(Bs) =1-(1-d)a-da=1-a. 


This now leads to the set estimate C's(x) determined as 


(€,07) €Cs(x) = ad < & (=) <1-—(1-d)a. 


Taking inverses and moving things around, we see that this is equivalent to 


(€,07) €Cs(x) > at CA ad) Ee SP Oba, 
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where z, = ®~'(y) is the normal quantile function. Now exploiting symmetry in 
the normal distribution and hence in the quantile function we note that z1;_, = —Zy 
and thus obtain the set 


Os(z) = {(€,07) : @—o20~-5)a <E< +025}. 


For fixed value of o?, this is a confidence interval, but generally, it is a rather large 
region in R x R,. If 6 = 0 or 6 = 1, we say the confidence interval is one-sided. If 
5 = 1/2, we say that the interval is two-sided and symmetric. 


Example 6.5. [Student’s 7’ as pivot] Example 6.4 gives a formally correct region 
for the pair of parameters (€, 07) jointly, but this is huge and not really very help- 
ful. Typically, we are interested in € only, whereas o? is a nuisance parameter. So 
consider independent and identically distributed observations X,,...,X,, where 
X; ~ N(E,o7) with € € R and o? > 0 both unknown. We may now use these 
repeated observations to obtain information about a? and form the quantity 
Th rae TE oA ehety pA) az Janet 

where as usual X,, = (X, +---+X,)/nand S? = 7,(X;—Xn)?/(n—1). For all 
values of (€,07), T;, follows a Student’s t-distribution with n — 1 degrees of freedom 
and hence T,, is a pivot. We thus get a 1 — a confidence region for € as 


Cr(X) = {€ER| #7} < TUE, Xi,-..,Xn) < HL ah, 


where a is the 6-quantile in the t-distribution with n — 1 degrees of freedom. 
Equivalently we may write 


2 ‘S45 S 
Cr(X) = (%.- 823 = X, + thr, ) 


Note also that we in fact have T? = W,,, where W,, is the Wald statistic in (5.7). In 
this case, W,, is distributed as F'(1, — 1) which tends to y?(1) for n > oo. 


Example 6.6. [Linear parameter functions in the normal model] An analogous but 
slightly more sophisticated example is the following. Let us again consider the lin- 
ear normal model , so that X ~ Ny (£,07Jv), where (V, (-,-)) is a d-dimensional 
Euclidean space, € € DL with L C V an m-dimensional linear subspace of V, and 
o? > 0 with both of € and o? unknown. 

Suppose our parameter of interest is a real-valued linear function of €, i.e. we 
have 7 = (u,&), for some u € V. From Example 4.22, we know that the MLE of € 
is ra = II, (X) and hence 


fj = (u, €) = (u, Tn (X)) = (Hz (u), X) 
and this is distributed as 


fj ~ N((u,€), 07 (u, Hz (u))) = M(n, 07 [Iz (u)|I). 
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As in the previous example we have information about a? from X — II,X which is 
independent of € and yields an unbiased estimator of a? as 


_ 2 
gf ISON apr 
— im 


We may now construct a pivot as 


2 A—-7 = Fe (u, p(X) — €) 
MnO = sarc = V4 "TOUTE 


which follows Student’s ¢-distribution with d — m degrees of freedom. This leads to 
the following confidence interval for 7 


C(X) = ((u, Mn (X)) - 422 9SE, (wu, 1,(X)) + 4c" SE) 
where 


[Tr (ew) |X — We) 


d—m 


SE — 


is the estimated standard error of the estimate 7}. Note the complete analogy to 
the confidence region found in Example 6.5, where we had d = n, m = 1, 
u=(1,...,1)/n, (u, U(X)) = Xn, and || (u)|| = 1/n. 
Example 6.7. [Pivots for the exponential distribution] Let us consider a sample 
X1,..-,Xn Of independent and identically exponentially distributed with unknown 
expectation 6 € © = R,. Since the exponential distribution is also a gamma distri- 
bution with shape parameter a = 1, we deduce that X,, ~ I'(n,6/n) and hence 


Xn 
0 


is a pivot. So if we let gQg25 and gj975 denote the corresponding quantiles in the 
I'(n, 1/n) distribution, it holds for all 6 € © that 


n 265 n 
Po {ices <= < aiors} = 0.95, 


or, equivalently, by taking reciprocals and multiplying with X,, 


Xn Xn 
Pos = <d0<— = 0.95 
90.975 Jo.025 


Xn Xr ) 
5.975. 96.025 


and thus 


C1 = Ci (X1,...,Xn) = ( 


is a 95% confidence interval for 8. For illustration throughout this chapter, we con- 
sider a sample of n = 8 observations with values 


0.581 0.621 3.739 4.354 0.409 1.843 1.705 0.312 
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here leading to the confidence interval C; = (0.94, 3.93). A universal pivot would 
exploit that Y = X, +---+ X, ~T(n,6) so 
Ro(X1, tee Xn} 6) = Gino (Y) = Gin,1/n) (Y/(n6)) = Gin,1/n) (X,,/0) 


where G'(.,) is the Gamma distribution function. Then for B = (0.025,0.975), 
the confidence interval C'g based on Rg would be identical to the interval C’, found 
above. 


In general there is a strong element of arbitrariness in constructing set estimat- 
ors, both in terms of the choice of pivot R and the choice of the set B used in the 
construction; still the computation of C'g() might be challenging. In the following 
we shall give some general methods for constructing set estimators. 


6.3 Likelihood-based regions 


It seems consequent to attempt to base a set estimate on the likelihood function or, 
equivalently, on the log-likelihood function. More precisely a likelihood-based re- 
gion has the form 


C° (a) = {0: A(x, @) < a} (6.2) 
for some a > 0, where A = A(x, @) is the log-likelihood ratio 
A L,(9) 
Ae SoG =O S20 oe 6.3 
(x8) =2 (62(0) ~ &2(0)) mT} (6.3) 


with 6 = 6(«) denoting an MLE of the unknown parameter 9. Note that we use 
the term log-likelihood ratio even though we have a factor —2 in front and here we 
explicitly need to consider the dependence of this statistic on data and parameters. 

This type of set appears particularly natural since it necessarily contains the MLE, 
i.e. it holds that 6 € C® (x) for all a > 0. This is fine if the construction works. 
However, the problem here is that the coverage cg(a) 


co(a) = Po{C*(X) = 6} 


in general depends on the unknown parameter 6, and hence a cannot be determined 
to achieve a specific level of confidence. 

However, in some examples we are lucky and the log-likelihood ratio is actually 
a pivot. In such cases the coverage can be determined by Monte Carlo methods even 
though the distribution of A(X, @) cannot easily be expressed in analytic form. 


Example 6.8. We consider again a sample X = (X1,..., X,,) from an exponen- 
tial distribution with unknown mean 6 € O = R4. We have previously—in Ex- 
ample 4.20—derived the log-likelihood function and MLE to be 


me 4 ean S 
0 ? 


£x(0) = —nlogé 6, = Xn = 


and thus the maximized log-likelihood function is 


x (0) = —nlog X;, — n. 
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Figure 6.1— The figure displays a histogram of 100000 simulated values of A(X, 4) in 
the exponential distribution for n = 8. Since A, is a pivot, it is sufficient to simulate for 
0 = 1. The 95% quantile in the empirical distribution is Ao.95 = 3.91 and is indicated 
by a vertical dotted line. 


Hence, the log-likelihood ratio statistic becomes 
PD, ¢; = Xx n Xx n 
Since X,, /@ is a pivot, the same holds for A,,(X, 0); indeed the likelilhood ratio 
statistic has the same distribution as 2n(Y — log Y — 1), where Y ~ I'(n,1/n). 
We may now determine, say, the 95% quantile Xo.95 in the distribution of A by 
simulation. Fig. 6.1 shows the result of such a simulation for the case of n = 8, 
leading to the quantile Ap.95 = 3.91. The 95% confidence interval is thus 


C2(X) = {89 € OJ A,(X, 6) < 3.91} 


which yet again demands a numerical solution. Fig. 6.2 shows the numerical calcu- 
lation for data in Example 6.7 with n = 8 observations, leading to the confidence 
interval Cz = (0.91, 3.74). 


6.4 Confidence regions by asymptotic pivots 


Although confidence regions constructed via pivots represent some elegance, they 
are not widely available and it is mostly necessary to rely on approximate methods. 
Likelihood-based regions may be difficult to handle exactly, but when the number n 
of observations is large we note that it follows from Theorem 5.17 and Theorem 5.35 
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Figure 6.2 — Determination of likelihood ratio confidence interval for the unknown 
mean. The horizontal line in the diagram is placed at the 95% quantile yo.95 = 3.91 
in the empirical distribution as determined in Fig. 6.1. The endpoints of the interval are 
given by the intersection of this line with the curve of log-likelihood ratio values. 


that the likelihood ratio statistic A, is an asymptotic pivot as is true also for the Wald 
statistics and the quadratic score statistic, since all of these have an asymptotic y?- 
distributions, i.e. an asymptotic distribution that does not depend on the unknown 
parameter 0. 


6.4.1 Asymptotic likelihood-based regions 


Consider a curved or regular exponential family so that the log-likelihood ratio stat- 
istic A,,(0) = A(X1,..., Xn, 0) approximately follows a .?(m) distribution where 
m is the dimension of the family. Further, let +~;_ (7m) denote the 1 — a quantile in 
this distribution. The asymptotic likelihood region is given as 


Cy_o(2) = C%-2(™) (gz) = {6 : An(0) < 1~a(m).} (6.4) 


Even though we may confide in the asymptotic results for determining the coverage, 
we would normally still have to solve the equation by numerical means, which can 
be quite involved in many dimensions. 


Example 6.9. Let us again consider the exponential distribution as in Example 6.8. 
We calculated the 95% quantile by simulation to be Xo.95 = 3.91, whereas the 
asymptotic 95% quantile for a x7(1) distribution is yo.95(1) = 3.84. The asymp- 
totic confidence interval for the data example is now determined in the same way as 
described in Fig. 6.2 to be C3 = (0.91, 3.68). 
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The next example shows that issues may be somewhat more involved when the 
parameter space is more than one-dimensional. 


Example 6.10. Consider a sample X = (Xj,...,X,) from a normal distribution 
N (€, 07) with both € and o? being unknown. We recall that 


and get for the log-likelihood ratio statistic 


“(Xi — €)? a2 yo (Xi - Xn)? 
= l - Ces 2 log S? = 
nloga = 32 nlog 5; 2, cA 
2 x6) 52 

= nlog aes ~§) +n(3 1) 

(X, — 6)? 
se seme) — n(log(1 + A,,) — An) 

Xy— 6)? . 2 
= nl ae +n" + e(An)[|VaAn|? 
as (Xn — €)? aN 
=n 2 n a 


where we have let A, = S? /o? — 1, used Taylor’s formula with Peano’s remainder 


term (Theorem A.24), and the fact that \/nA,, SN (0,2) to obtain the last two 
equations. 

Although we shall not use this directly, the calculation illustrates how the approx- 
imate y?(2)-distribution appears as the sum of the term n(X,, — €)?/a?—which has 
an exact 7(1)-distribution—and the independent term 


which has an approximate y7(1)-distribution. Even in this simple case, the asymp- 
totic confidence region 


CO7-2(2) (Xx) = {(60%) : nlog z spel = 6" +n (= 1) < n-a(2)} 


2 
a oO oO 


can only be calculated numerically. 


6.4.2. Quadratic score regions 


An alternative form of set estimators is based on the quadratic score statistic from 
Definition 1.24 7 : 
Qn(X, 0) = nSn(X, 0)i(0)-1S,,(X, 6)" 
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which has an asymptotic .7(m)-distribution if the family is smooth and stable and of 
dimension m by Corollary 1.35 and hence also the quadratic score Q,, is an asymp- 
totic pivot. The corresponding confidence set is then 


C = {6 | nSn(X, 0)i(0)~'5,(X, 0)" < y-a(m)} 


where S,, is the average score statistic and 1—-a(m) is the 1 — a-quantile in the 
x?(m)-distribution. Clearly, this demands that the score statistic S,, and Fisher 
information is available and the set itself would typically need to be determined nu- 
merically. 

Example 6.11. [Quadratic score intervals for the exponential distribution] For the 
exponential distribution with mean 0, we calculated the score statistic, Fisher in- 
formation, and quadratic score in Example 1.32 for the mean @ to yield 


2{%m 1 2 In ‘ 
Qn = n0 a) =n 7 ees ‘ 


Since the quantiles in the normal distribution satisfy z;_ 4/2 = \/Vi—a for the x?(1)- 
distribution, we get the confidence interval 


C= 1 1 

= In =) : 
1+ 2-a/V/n' 1- 2-a/V/n 

For the data in Example 6.7, we have n = 8 and 20.975 = 1.96, leading to the 

confidence interval C4 = (1.00, 5.52). 


6.4.3 Wald regions 


When likelihood ratio regions are difficult to calculate, an alternative is to use the 
quadratic approximation to the log-likelihood function as described in Theorem 5.17 
or Theorem 5.35 using the Wald statistic 


which is asymptotically distributed as y?(m) where m is the dimension of the model 
under investigation so the corresponding set estimator is 


C= {B| n(Bn — B)"i(Bn)(Bn - 8) < n-a(m)}, 


where 71.7) is the 1 — a quantile in the y?(m)-distribution. 

The confidence region is an ellipse centred around the MLE ome If the unknown 
parameter (@ is one-dimensional, the ellipse becomes an interval and may alternat- 
ively be calculated as 


8 Z(1-a/2) 4 4(1-a/2) 


4/ni (Bn) a (Bn) 


where 2(1_4,/2) is the l—a /2 quantile in the standard normal distribution and SE is 
-1/2 


C= 


2 Bri _- 2(1-a/2) SE, 


the estimated standard error (ni(@,)) of the estimate. 
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Example 6.12. We consider again the exponential distribution where we previously 
have calculated the information for a single observation to be i(9) = 0~? and the 
MLE as 6,, = Zp and thus the Wald region is given as 


iss (, _ 20-0/) gy ‘scat _ 3, (1 *-e/2) 4 4 “oa? . 


ee eee vn vn 


which for the example data yields the interval Cs = (0.52, 2.87). 


Example 6.13. [Continuation of Example 5.33] Consider again the model with con- 
stant coefficient of variation, i.e. we assume X; ~ (8,67), where G > 0 is un- 
known and assume that we have n = 10 observations with t,, = (0.5,3) | leading to 
Bn = 1.5, as calculated in Example 5.33. 

To obtain a likelihood-based interval, we recall the log-likelihood function was 


derived in Example 5.28 as 


SS ee Sn 
267 B 
and since a solution to the score equation satisfied S'S,, = SnBn + nB?2, we have for 

the maximized log-likelihood function 
ln (Bn) = +5 


£n(B) = 


nlog B 


‘K n A ie 
= nlog By = -—~ + — — nlog By. 
ope Go Ea ae eee 


We thus get for the log-likelihood ratio statistic 


An(8) = 2 Bn) — £n(B)) 
Si 5 , S8n  28n 
= -—n+—— —2nlog Bn + —- - —+2nlo 
3, g B Re B gB 
5 30 = 10 
1 
= _ — 5 + 20 log 6 — 14.776. (6.5) 


An asymptotic likelihood-based interval is now obtained by finding the roots of the 
equation A,,(3) = 3.84, since 3.84 is the .95 quantile in the \?-distribution with 
one degree of freedom. The equation must be solved numerically and yields the 95% 
confidence interval Cy = (1.05, 2.41). 

In Example 5.31 we established that the asymptotic distribution of Br was 
N (8, 87/3n), and hence, the Wald statistic using the theoretical variance is 


3n F 15 s 
WrlB) = ay (Bn — 8B)” = 30(— -1) . (6.6) 
B B 
Solving the equation W,,(3) < 3.84 yields the interval 
Cw = B. : : = (1.10, 2.34) 
Me Oe a0: bea Ooe SOI) eo 
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Finally, a Wald-based confidence interval using the estimated asymptotic vari- 
ance 3? /3n would be 


Cy = 1.5 + 1.96,/1.52/30 = 1.5(1 + 1.96,/1/30) = (0.96, 2.04). 


Again this is somewhat shorter than the other intervals as we often see using this type 
of Wald statistic. 


We conclude with a multivariate example. 


Example 6.14. [Set estimates for gamma distribution] Since an exponential distribu- 

tion is also a gamma distribution, it makes sense to analyse the data from Example 6.7 

using the gamma distribution with unknown shape a and scale (3 where (a, 3) € R2. 
The maximum-likelihood estimates of the unknown parameters are 


&=1.330, 8=1.275 


and the Fisher information was calculated in Example 5.16 to 


i(o, 8) = es | 
B 


We shall calculate the Wald statistic W so the estimated information matrix becomes 


a ts ie | 


Be BlF 


0.784 0.818 


The 95% quantile in the y?(2)-distribution is yo.95 = 5.99 so the corresponding 
Wald region based on these n = 8 observations is the ellipse 


8(1.099(a — 1.33)? + 1.57(a — 1.33)(8 — 1.275) + 0.818(8 — 1.275)”) < 5.99 


which is displayed in Figure 6.3. 
The confidence ellipse may be compared to the asymptotic likelihood region 
determined as 
A(x, ear 35 a, 3) < 5.99. 


This likelihood regions can only be determined numerically and are displayed in 
Fig. 6.4. We note that the likelihood regions are not so well approximated with el- 
lipses and have more the shape of a banana; the likelihood-based regions also appear 
to be larger than the Wald regions, suggesting that the Wald regions may not give the 
right coverage. 


6.4.4 Confidence regions for parameter functions 


Here we describe a simple way of constructing approximate confidence regions for 
general smooth parameter functions. Suppose we are particularly interested in estim- 
ating a smooth parameter function \ = ¢(0), where ¢ : O +> R™ is not necessarily 
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Figure 6.3 — The 95% confidence ellipse for the parameters of a gamma distribution 
based on the Wald statistic. 


2:5 


eS 4 15 10 6 4 
0.5 1.0 1.5 2.0 25 
Qa 


Figure 6.4 — Level curves of the log-likelihood ratio, and the corresponding likelihood 
ratio set is determined by the contour for A = 6. The dot represents the MLE (4, {). 
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injective, but the Jacobian D¢(6) has full rank m. It is then fairly easy to construct 
confidence sets for \ based on Wald statistics. 

Indeed, if we are in a situation where 6,, ~ N(0, i(0)~!/n) we may use the delta 
method (Theorem A.19) to deduce that 


in = (65) 2N’ (A, 2 Do(H)(0-*D6(0)" ) 


and by the usual arguments we conclude that the Wald statistic 


=al' ke 


Wane" {Dobn)i(On)-1D4(Bn)" } (in —») B x2(m) 


which now gives a sound basis for making confidence sets. The most common applic- 
ation of this is to construct confidence intervals for single coordinates, corresponding 
to the case with m = 1. In that case, the regions become intervals and will all have 
the form 


C = An ae 1-9 /2SE 


where SE is the estimated standard error 


SE = 1 {280n)s,)-2D4(Gn) fn. 


Example 6.15. We illustrate this procedure in the gamma example. Suppose we 
specifically wish to construct a confidence interval just for the shape parameter a. In 
Example 6.14 we calculated the information matrix to 


x m . {1099 0.784 
(a, 9) = 
oe bes a 


with scaled inverse 


ne 0.36 —0.34 
i(@, B)~*/8 = 
(4,6) / —0.34 0.48 
so the 95% Wald interval for @ becomes 


C = Gy £1.96V0.36 = 1.33 + 1.18 = (0.15.2.51). 


If we instead consider a confidence interval for the mean of the distribution, i.e. 
b(a, B) = a8 = Eq,e{X}, we get 


Do(a, B) = (B,a) 


implying that the asymptotic variance is 


__ (Ba) —B Ae a8” 
8(aW1 (a) — 1) & a (8, a) 8 
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and since @6? /8 = 0.27, this leads to the interval 


Co = 1.6955 + 1.96V0.27 = (1.17, 2.23) 


which is comparable to the intervals we earlier found for the mean, using the expo- 
nential distribution. 

This could also have been derived directly by realizing that the MLE in an ex- 
ponential family is a moment estimator, and hence, the mean E, g(X) = af is 
estimated by 


GinBn = Xs x N (08, ~Vna(X)) =N (a8, ) 


leading to the same result. 


6.5 Properties of set estimators 


We have seen a number of methods for constructing set estimators but have not paid 
much attention to discussing what are good properties. We first consider how the set 
estimators behave under reparametrization. 


6.5.1 Reparametrization 


We consider a diffeomorphism ¢ : 0 + A and recall from Theorem 1.30 that the 
log-likelihood function and information function in the two parametrizations satisfy 


€,(8) = l2(A),  i(8) = De)" i(d)D9(8) 
for X = ¢(6). Since the log-likelihood function is equivariant, the likelihood ratio 
based regions are as well, since for any a 


6€C* = {6|A(a, 0) <a} => AE C*={A|A(z,A) < a} 


and thus we have 7 

Ce = o(C%). 
__ The same holds for sets based on pivots, since if R(X, 2) is a pivot, we have 
R(X, 0) = R(X, d) for X = $(0) so such sets are also equivariant. 

As the quadratic score statistic is equivariant so are the associated confidence 
regions. However, this is not true for the sets based on the Wald statistics since, for 
example, : 7 : 
whereas in the parametrization we have for \ = ¢(6) 


W(A) = n(An — A)" Dd(8) H(A) DG) (An — A) 


and similarly for other types of Wald statistics. 

As a consequence, we shall be careful with the choice of parametrization when 
calculating Wald intervals and often we shall first choose a suitable parametrization 
and then transform the relevant interval back to the scale we want. We illustrate this 
in an example 


PROPERTIES OF SET ESTIMATORS 149 


Example 6.16. [Exponential rates] Let us illustrate the above considerations for the 
simple model with an exponential distribution, assuming that we are interested in 
confidence intervals for the rate \ = 1/0. We previously found that 


My De 
04 = OU Kayoeeke) = (Ae, 0) 
90.975 90.025 


is a 95% confidence interval for @ and the similar confidence interval for the rate is 
now 
90.025 90. 
Ci _ Ci (Mays Xn) _ ( igs, os, ) : 
The same is true for the likelihood-based intervals. 
If we, however, consider the Wald intervals, we have a different story. In Ex- 
ample 6.12 we found the following confidence interval for the mean 0 


= %(1—a/2) (1-0/2) 
C=, (1 il : 
. ( va yn 


We have previously calculated the Fisher information in the \ parametrization to be 
i(A) = 1/,? and thus the Wald set for the rate becomes 


Ga {. | pee < nett) 


which again leads to a quadratic equation but now the interval becomes 


C= (Bane ae) tue 


The absence of equivariance may in certain circumstances be an advantage. Sup- 
pose we consider the parameter 7 = log 6. Then we get that the Fisher information 
for 7 satisfies 


Ley ll 


(0) = ils 


and since we have i(@) = 6~?, the Fisher information for 7 is constant: i(7) = 1. 
This implies that all variants of Wald intervals for 7 are identical and given as 


Cc" = (ioe En - log Zn aaa : 


A transformation of this type is known as a variance stabilizing transformation. We 
may now choose to transform this interval to the original scale and use 


CP a (2s pee) 


as a confidence interval for 6 yielding C7 = (0.85, 3.39) for our data example. 
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Table 6.1 — Empirical coverage and average length (in parenthesis) of 95% confidence 
intervals for 6 based on 5000 repeated samples of size n from an exponential distribution 
with mean 6 = 5. 


Pivot A Q Ww w* 
n= 10 0.946 0.947 0.949 0.901 0.939 
(7.49) (7.20) (10.05) (6.19) (6.59) 
n = 50 0.955 0.956 0.956 0.941 0.952 
(2.87) (2.87) (2.99) (2.76) = (2.80) 
n=500 0.950 0.951 0.951 0.948 0.949 
(0.88) (0.89) (0.88) (0.88) (0.88) 


6.5.2 Coverage and length 


We have now several possibilities for calculating a set stimate, and for the simple 
exponential model, it leads to seven different intervals: 

a) Based on the exact pivot R(«,0) = X;,/0: C1 (x) = (0.94, 3.93); 

b) Based on A,,(a, 0) with exact cutoff: Co(a) = (0.91, 3.74); 

c) Based on the likelihood ratio with asymptotic cutoff: C3(x) = (0.91, 3.68); 

d) Based on the quadratic score: Cy(x) = (1.00, 5.52); 

e) Based on the Wald statistic: Cs5(x) = (0.52, 2.87) 

f) Based on the Wald statistic in the gamma model: Cg(a) = (1.17, 2.23); 

g) Stabilized Wald W;* = n(log 6 — log @)?: C7(x) = (0.85, 3.39) 

They all have either exact or approximate coverage equal to 95% but the approx- 
imations may be of different quality. They have quite different lenghts, and if we can 
keep the coverage, we would rather have short intervals than long intervals, to give 
as precise an estimate as possible. 

Table 6.1 displays the result of a simulation experiment based on 5000 repetitions 
of samples of size n for n = 10,50, 500 from an exponential distribution with mean 
6 = 5. Ineach case five different intervals, corresponding to a), b), d), e), and g) have 
been calculated. We have also calculated the average length of the intervals and the 
empirical coverage, i.e. the number of intervals containing the true value @ = 5. 

We first note that the empirical coverage follows a binomial distribution with 
success parameter being the true coverage and N = 5000. If we let 1 — a = 0.95, 
confidence intervals for the empirical coverage are obtained by adding 


+1.96,/0.95 x 0.05/5000 = +0.006 


to the numbers in the table. 

The coverage is mostly as it should be, save for the case n = 10, where W and 
W* have coverages that are too small, the stabilized W* performing better than W. 
This phenomenon is still visible for n = 50. 
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The shortest intervals are those given by W and W%, but the price is paid in terms 
of failing coverage. The quadratic score intervals appear to have excessively long 
intervals for n = 10 and still rather long intervals for m = 50. Among the intervals 
that give correct coverage, it appears that the two exact intervals based on the exact 
pivot and A are shortest and therefore preferable. 

For n = 500 it really does not matter and all types of set estimators perform 
equally well. 


6.6 Credibility regions 


Confidence sets are often misinterpreted, and it is a common mistake to say that a 
95% confidence interval, C’ say, contains the true parameter with 95% probability 
or—rephrasing—that the probability that 9 € C' is 95%. This is incorrect and indeed 
the statement makes little sense as 6 is not random, but rather the set C = C(X) 
is. In fact, the coverage of 95% is a property of the procedure used to construct the 
interval, not a property of the interval itself. 

However, there is an alternative statistical paradigm—the Bayesian paradigm— 
where statements of this kind do make sense and we shall briefly sketch the argu- 
ments within this paradigm without going too much into detail. 

Consider a (Fisherian) statistical model P = {P,0 € ©} with an associated 
family F = {fo|0@ € ©} of densities with respect to a base measure jz on the 
representation space (V,E). A Bayesian statistical model adjoins what is known 
as a prior distribution 7 on the parameter space (O,T) which then must have an 
associated o-algebra T. The prior distribution reflects what is believed about 6 prior 
to observing X = x and in combination with the Fisherian model, it specifies a joint 
distribution P over (0 x 4,T x E) through the relation 


P(AxB)= | fola)n(ae) x(a), AET,BEE 


Within the Bayesian paradigm, a Fisherian model is simply incompletely specified, 
as the prior knowledge about @ fails to be represented in the model, as it only specifies 
the distribution of X for fixed values of 6. 

Having observed the outcome X = z, the information about @ is updated using 
Bayes’ formula to yield the posterior distribution 7* given as 


_ Sa foln) an (0) 
Jog Ful) da(n) 


where k(a) is a normalizing constant ensuring that the integral is equal to one: 


m*(0 € A) 


= k(x)? a La(0) dn(6), 


ka) =f beln) dan). 


In other words, the likelihood function Lz is the density of the posterior distribu- 
tion 7* with respect to the prior distribution 7, sometimes written as 


posterior «x likelihood x prior. 
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Note that any arbitrary multiplicative constant in the likelihood function cancels in 
the normalization process so the posterior distribution 7* only depends on the shape 
of L, and not its absolute size. Note that this makes a lot of sense, since likeli- 
hood functions are only well-defined up to arbitrary multiplicative constants; see 
Theorem 1.16. 

A set A(x) C © with A(x) € Tis a 1 — a credibility region for 0 if it holds that 


m*{0€ A(x)} = k(ax)~* i Plt) dr(0) =1—a. 


Note here that 6 is random (unknown) while A(x) is fixed and known; so in this 
paradigm, it makes sense to say that the probability that @ is in the region is equal 
to 1 — a, but it demands an alternative interpretation of the notion of probability, 
interpreting probabilities via betting, so P(A) = p means that the odds 


odds = Pes 
1 


would be fair odds in a bet on the event A occurring. Such an interpretation of prob- 
ability is known as subjective probability. 


Example 6.17. [Exponential distribution] Let us again consider the model for the 
exponential distribution. If we say that the rate \ = 1/0 has an exponential prior 
distribution with mean 1, i.e. 


m(A)=e*, ADO 
we find the posterior distribution to be 
m*(\) « Ne Ar Di te = Are PAG DL 2), A>0 


which we recognize as a gamma distribution with scale parameter (1+ >, x;)~' and 
shape parameter n + 1. 

In our standard data example, we have n = 8 and SS x; = 13.564, meaning that 
the posterior distribution of \ is ['(9,1/14.564). Thus we have the following 95% 
credibility interval for \: 


go.025(9, 14.564) <A< 90.975(9, 14.564) 


where g-,(a, 5) is the y quantile in the gamma distribution with shape a and rate 4. 
This gives the following interval for the mean 6 = 1/2 


1 1 
<O0< 
g0.975(9, 14.564) Jo.025(9, 14.564) 


which yields the credibility interval A(z) = (1.070, 3.539). 


The credibility interval A(x) found above is quite similar to the confidence inter- 
vals we found using other methods, even though the interpretation is different. This 
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is not a coincidence. Indeed one can show (we refrain from doing this here) that it 
holds asymptotically in the posterior distribution for large n that 


0% Ne(On,i(On) 1 /n). 


Here in the Bayesian paradigm 0 is random (since it is unknown), whereas On is fixed 
(since it has been observed). The consequence is that it holds—also in the posterior 
distribution—that the Wald statistic W,, satisfies 


W,,(x, 0) = n(0 — 6) "i(On)(0 — On) © x7(k). 
This means that the 95% confidence interval 
C(x) = {0|Wr(x,) < Ye} 


is also a credibility interval. So in such cases, there is some justification in (mis)in- 
terpreting the confidence interval as a credibility interval. 


6.7. Exercises 


Exercise 6.1. The Weibull distribution with shape parameter a > O has density 
function 

fa(x) = ax%—be-™" 
with respect to standard Lebesgue measure on R,. The Weibull distribution is, for 
example, much used in Reliability Theory as a distribution of the time to failure of a 
component in a complex system. Assume that X follows such a Weibull distribution. 
Construct a 95% two-sided confidence interval C(X) for a based on the universal 
pivot. 
Exercise 6.2. The von Mises distribution is a distribution on the unit circle or, equi- 
valently, the interval (—7, 7] with density 


ef” cos(#—0) 


Fico(2) = “Qn Ig(K) 


with respect to standard Lebesque measure on this interval. Here « > 0 is the pre- 
cision of the distribution and 6 € © = (—7,n] is the principal direction; the nor- 
malizing constant Jo(«) is known as the modified Bessel function of order 0. This 
distribution is important for analyzing directional data. 

Now, consider « to be fixed and known and @ unknown and assume that a random 
variable X with this distribution is observed. We are interested in constructing a 
confidence set for the principal direction 6. 


a) Show that the likelihood ratio statistic is 
A(X, 0) = 2K(1 — cos(X — 0)). 


b) Show that A(X, @) is a pivot. 
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c) Determine 95% confidence sets C"(X) for @ based on observation of X for a 
range of values of «, for example by determining the relevant quantiles by Monte 
Carlo simulation. 


Exercise 6.3. Consider again the inverse normal distribution as in Exercise 3.4 with 


density 
m —\ax — p)? 
fuale)= anes © {ame f 


with respect to standard Lebesgue measure on Ry and consider the subfamily Po 
determined by the restriction 44 = 1, parametrized with 4. Let X1,..., Xp be a 
sample from this family. 


a) Show that the MLE Ne for is determined as 


R 1 r Tee, it 
An = ==, where Y;, = — =>. 
X,tYn—2 ees 


b) Determine an asymptotic confidence interval for \ based on the quadratic score 
statistic for the sample. 

Exercise 6.4. Let X and Y be independent and exponentially distributed random 

variables with E(X) = 8 and E(Y) = 1/8 where 6 > 0 as in Exercise 5.10 and let 

(X1,V1),.-.,(Xn, Yn) be a sample from this distribution. 


a) Determine an asymptotic confidence interval for G based on the quadratic score 
statistic using the sample (X1, Y1),...,(Xn, Yn). 

b) Determine an asymptotic confidence interval for 6 from the moment estimator Bn 
for 3 based on (X1, Y1),...,(Xn, Yn) and the statistic t(a, y) = x — y. 

c) Make a simulation study to compare the coverage and length for the two types of 
interval. 


Exercise 6.5. Let X follow a log-normal distribution as in Exercise 1.3 with para- 
meters (€,07) € R x Ry which are both considered unknown. In other words, 
Y = log X with Y ~ N(€,07). Consider a sample X,,..., X», from this distribu- 
tion. 
a) Determine an asymptotic Wald confidence interval for the median X of the distri- 
bution 
A= o1(€,07) = medgg2(X) = ef, 


b) Determine an asymptotic Wald confidence interval for the mean ju of the distribu- 
tion . 
p= b2(€,07) = Eyo2(X) = 177. 
c) Determine an asymptotic Wald confidence interval for the coefficient of variation 
6 of the distribution 


6=¢3(6,0°) = Cp,7(X)= ev —1. 


d) Investigate the coverage of these intervals for various values of n and o? by sim- 
ulation. 
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Exercise 6.6. Let X ~ binom(n, js), 4 € (0,1). Determine the following approx- 
imate confidence intervals for the unknown probability of success ju: 

a) A confidence interval based on the quadratic score statistic Q(X, {4); 


b) A confidence interval based on a Wald statistic for the log-odds ratio 


é = log 


2 


transformed back to pu. 


c) Aconfidence interval based on a Wald statistic for the parameter y = sin~ !( a/ Th)» 
transformed back to pu. 


d) Make a simulation study to compare these intervals with respect to coverage and 
length. 


Note that the length n in the binomial distribution of X can both be considered just 
as a label, but also representing that X = Y,; +---+ Y, where Yj,...,Y, are 
independent and identically Bernoulli distributed. In this way, the use of asymptotic 
results is justified, simply because the Fisher information tends to infinity. Compare 
this to the discussion in Example 5.41. 


Taylor & Francis 
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Chapter 7 


Significance Testing 


7.1 The problem 


We consider a statistical model with associated family P = {Pg | € ©} on the rep- 
resentation space (, E) and an observation X = x. We are interested in determining 
whether this observation supports specific statements about the unknown parameter, 
for example that the parameter has a specific value 6) € O, or the parameter is inside 
a specific subset O09 C O, known as a hypothesis. 

Although hypothesis testing is very much at the core of many applications of 
statistics, the subject is quite controversal, partly because hypothesis testing is used in 
a huge variety of contexts and some considerations may not appear equally relevant 
in all contexts. 


7.2 Hypotheses and test statistics 
7.2.1 Formal concepts 


Formally we consider a null hypothesis 
Hp: 0€ Oo, 


where Og C O. Generally we do not distinguish between the hypothesis Hp and the 
representing subset Oo. We use the term alternative hypothesis for the complement 


Hyz:0€0\ Op. 


We may equivalently represent a hypothesis as a subfamily Po = {P9|0 € Oo} CP 
of the family associated with the basic statistical model. 

When testing a statistical hypothesis, the first problem is to choose a suitable 
test statistic which is a map d : X¥ ++ R. Without loss of generality, we assume 
here and later that the test statistic is chosen so that large values of d indicate that the 
hypothesis is unlikely or, in other words, d measures a deviation from the hypothesis. 
A suitable transformation of any chosen statistic will always have this property. We 
express this by saying that large values of d are critical. 

We have already seen such statistics in the previous chapter, as they have been 
used to construct set estimators. A canonical choice of test statistic is the maximized 
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likelihood ratio test statistic or, in short, the likelihood ratio 


SUPge Oo Ly (0) 


A=A = —2) 
SaaS rrenG A() 


obtained by comparing the highest possible values of the likelihood function assum- 
ing the null hypothesis with the highest possible value without this restriction on the 
parameter. 


Indeed, if we let 6 and 6 denote the maximum-likelihood estimates (MLE) as- 
suming the smaller model and the model without this assumption 


6 = arg max L,(0), 0=arg max L,(0), 


we have 


A(z) =2 (¢0(6) = ¢.(6)) (7.1) 


Alternatives include statistics of the Wald type, and others, as we shall see in the 
following. 


7.2.2 Classifying hypotheses by purpose 


The hypothesis might have its own substantive interest, such as whether a treatment 
is ineffective or not, or it might just represent a desirable simplification of the model. 
Below we shall highlight some of the relevant possibilities. 

Simplifying hypotheses: We wish to investigate whether a simplification of the 
model represented by the smaller family Py could be suitable; here the issue is to 
avoid models that are unnecessarily complex, for example because our estimates of 
the unknown parameters then become more precise and reliable. Or we might wish 
to use the scientific principle sometimes known as Occam’s razor that would always 
prefer a simple model to a complex one, if the simpler model is still satisfactory. 

Confirmatory hypotheses: We may have a theoretical reason that predicts a spe- 
cific value of a parameter function, but wish to confirm the theoretical value by an 
empirical investigation. 

Reverse hypotheses: Consider an experiment that is made with the purpose of 
establishing that a certain treatment, medical or other, definitely has an effect, say of 
magnitude @ € R. It is then customary to formulate the null hypothesis in a reverse 
manner, as Hp : 6 = O, ie. the formal statistical hypothesis says that there is no 
effect. This is one of many examples of statisticians using the term ‘hypothesis’ in a 
manner different from most scientists. The hope is that the statistical test can falsify 
the hypothesis and thus establish beyond reasonable doubt that the treatment actually 
has an effect. 

Model criticism: Occasionally we perform statistical hypothesis testing as an 
activity of the Devil’s advocate; we wish to subject our statistical model to criti- 
cism and make sure the model survives that criticism, thus increasing our confidence 
in its validity. Such hypotheses are often tested by graphical methods, say in residual 
analysis, rather than formal quantitative tests. Another term that is often used for this 
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type of tests is goodness-of-fit. It is characteristic for this type of hypothesis that the 
alternative hypothesis H_4 may not be fully specified, or not specified at all, and cer- 
tainly does not play an an important role: we only specify the test statistic D = d(X) 
and investigate whether its value conforms with Ho. 

The list of possibilities given above is far from exhaustive, as the terminology 
and methodology of hypothesis testing is used in an almost infinite variety of ways. 
As a consequence, it is hard to present a single formal theory of hypothesis testing 
that seems suitable for all these different purposes, and the theory will from time to 
time seem awkward in specific situations. 


7.2.3 Mathematical classification of hypotheses 


Another dimension for the classification of statistical hypothesis is through the math- 
ematical properties of the subset Oy C O. 

Simple or composite: We say that the hypothesis is simple if 09 = {0} consists 
of a single point. If the hypothesis is not simple, we say that it is composite. 

Linear hypotheses are hypotheses given as O09 = LO, where L C V is a linear 
subspace of a vector space V with O C V. The subspace L can either be given as an 
image or inverse image of a linear map: 


Go = {0 € 0/9 =AB,BER™}, Oo ={0€ O| HO =}. 


Affine hypotheses are hypotheses given as Og = LO, where L C V is an affine 
subspace, given as an image or inverse image of an affine map: 


Oy = {9 € O|6=AB+),BER™}, On = {6 € O| HO = Bo}. 


Smooth hypotheses are given as an image or inverse image of a smooth map for 
© being an open subset of R*: 


Oo = {9 € O|9=4(8),BEBCR™}, Oo = {0 € O[ H(A) = Bot 


where maps ¢, h are smooth and have Jacobi matrices with full rank. For exponential 
families, such hypotheses correspond to curved subfamilies. 


7.3 Significance and p-values 


To judge whether a given hypothesis is reasonable, we may (at least in principle) 
calculate the p-value 
p= sup Pe{D > d(x)} 
0€Oo 

where D = d(X) is the random variable corresponding to our test statistic and d(2) 
the observed value. The p-value is the highest probability that can be achieved for 
the event {D > d(a)} while maintaining the hypothesis that 6 € Op. If this is very 
small, say p < €, we evoke Borel’s single law of chance, also sometimes referred to 
a Cournot’s principle: 
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IMPROBABLE EVENTS DO NOT OCCUR. 


We then conclude that the hypothesis cannot be maintained. We also say that there 
is significant evidence against the hypothesis. With some imprecision, we may also 
say that the test is significant. 

Note that the process of significance testing as described above fits well into Karl 
Popper’s theory of scientific progress through falsification (Popper, 1959), reflecting 
the time period when it was conceived, early in the 20th century while Karl Popper 
(1902-1994) also developed his philosophical theory of scientific evidence. 

It remains to be quantified what ‘very small’ is. This appears to be mostly cultur- 
ally defined and depends often on the context. Emile Borel (1943) set the following 
scales for probabilities to be small when evoking his single law of chance: 


° 1’échelle humaine (human scale): ¢ ~ 1076 
* l’échelle terrestre (earthly scale): ¢ ~ 107-!° 
* 1’échelle cosmique (cosmic scale): ¢ ~ 10~°°. 


Modern statistical practice mostly uses ¢ € {0.05,0.01, 0.001}, and it is common 
to speak about ‘one-, two-, or three-starred significance’, but in Particle Physics, for 
example, an ¢ between 10° and 10? is used for a scientific discovery to be acknow- 
ledged. In general we need different scales in different areas to allow for scientific 
progress and simultaneously prevent too many false conclusions. 

If we have decided in advance what ‘very small’ means, we speak of a level of 
significance a € (0,1) and we would then reject a hypothesis if p < a. We then say 
that the test is significant at level a. Clearly, if a; > a, any test that is significant at 
level a is also significant at level a1. 

We emphasize that statistical significance is not the same as importance. Consider 
the following simple example. 


Example 7.1. [Male births] In the year 1998 there were a total of 66170 live births in 
Denmark, of which 34055 were boys and 32115 girls. We wish to investigate whether 
the probability of a random child being born as a boy is the same as that of being born 
as a girl. 

We let X denote a random variable corresponding to the number of male births 
and specify a corresponding statistical model P = { Py | 0 € © = (0,1)} where Po is 
the binomial distribution of length n = 66170 with parameter 9. Our null hypothesis 
is Hp : 6 = 1/2. Under the hypothesis, we should expect around 66170/2 = 33085 
male births. We choose to use the test statistic 


d(x) = |a — 33085] 


measuring the deviation from this expectation and calculate the p-value approxim- 
ately using the central limit theorem yielding a normal approximation to the binomial 
distribution 


p = Pij2{D > |34055 — 33085] = 970} 


2/1-@ zt = 4.64 x 1074 
,/66170/4 


2 
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so this is extremely significant, even at a level close to Borel’s earthly scale and 
certainly at a level corresponding to what is used in modern Particle Physics. Never- 
theless, the frequency of male births is 


x/n = 34055/66170 = 0.514659... 


so for most practical purposes, this is a neglible deviation from the equiprobable and 
therefore typically unimportant; see also Example 8.2 for a similar situation. 


The fact that a statistical test is significant just means that it would not be reas- 
onable to attribute the observed deviation to chance, however small and unimportant 
the deviation may be. 


7.4 Critical regions, power, and error types 


An alternative formulation of hypothesis testing is to see this as a decision problem 
and an associated partition of the representation space into two regions: 


* A critical region KK C &. 
* An acceptance region A= X\K. 


The interpretation is then that the hypothesis is rejected if x € K and accepted 
if ¢ € A. Again we can without loss of generality assume that these regions are 
determined by a test statistic d as before with 


K = {x: d(x) > dein}, A= {x: d(x) < dort} 


where dg,it is the critical value for the test. Thus we can identify the test with its 
critical region, or with the pair (d, d,it) of the test statistic and its associated critical 
value. 

To further investigate the behaviour of such a test, we introduce the power func- 
tion y : © — (0, 1: 


1K (0) = Po{K} = Po{D > derit} = 1— Pot A} 


giving the rejection probability as a function of the unknown parameter 0. 
When accepting or rejecting a hypothesis, we may commit two types of error. We 
say that we commit an error of type I if we reject a hypothesis that is true, ie. if 


x €Kandé € ©. 


Similarly, we say that we commit an error of type II if we accept a hypothesis that is 
false, i.e. if 
x € Aand dé ¢ Oo. 


The power function determines the probability of committing these errors although 
they in general depend on the unknown value of 6; indeed, the probability of a type I 
error is 


7 (0) = Po{K}, 0€ Oo 
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whereas the probability of a type II error is 
Ba(O) =1—7x(9), 8€O\ Oo. 


We would ideally like both of these errors to be small. The size dx of the test is the 
lowest upper bound for the probability of a type I error: 


dx = sup y(9) = sup Po{K}. 
0EOo 0E€Oo 

It might be difficult to determine the exact size of a test so we also say that the test 

has level a € [0,1] if its size is at most a 


on <a. 


When constructing a test, we shall attempt to maximize its power outside the hypo- 
thesis, i.e. minimize the type II error probability, while ensuring a given level a, i.e. 
controlling the probability of a type I error. 

We emphasize that power considerations are only relevant in the planning phase, 
when constructing tests and test statistics, or deciding on the sample size for an ex- 
periment yet to be conducted. Once a test has been designed and the test statistic 
calculated, it is formally the associated p-value that carries the relevant information. 
However, it is worth knowing about the power of the test, simply because a test with 
low power may not be able to falsify a hypothesis and therefore provides only weak 
evidence. 


Example 7.2. [Multiple choice examinations] We consider the design of a set of 
exam questions for a multiple choice examination. The idea is to have three choices 
for every n questions in the exam. So a student who knows about 50% of the material 
for the course would then give correct answers to half of the questions and guess the 
rest, thus having 1/2 + 1/6 = 2/3 of the n questions right in expectation. Motivated 
by this, we shall demand that 65% of the questions are correctly answered for the 
student to pass the exam. 

Now we are interested in designing the examination in such a way that students 
who have not followed the course at all and simply guess all answers will fail with 
high probability. So how many questions do we need to achieve that such students 
will fail with probability .999? 

We may see the examination as a statistical test based on observing the number 
of correct answers X taking values is the representation space V = {0,...,n} with 
critical region 

Ky, = {x € X |x < 65n/100}. 


We also assume that X ~ binom(n, ™), where 9 € O = [1/3, 1] and wish to determ- 
ine n such that the power 


Yn(9) = Po(Kn) 
is at least .999 at 0 = 1/3. We have 
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Figure 7.1 — The power of a multiple choice examination with three choices in each 
question as a function of the total number of questions. The horizontal line is at .999. 


where fF, 9 is the cumulative distribution function for a binomial distribution of 
length n and success probability 1/3. This function is plotted in Figure 7.1, and we 
conclude that n = 20 questions are necessary for us to achieve this goal. Then the 
student must answer x = 13 questions right to pass the exam. Note that the power 
decreases from n = 20 to n = 21 as the pass boundary remains at x = 13. The 
discreteness of the space is the reason that the power is not a monotone function of 
the number of questions n. 


7.5 Set estimation and testing 


There is a fundamental logical relation between set estimation and testing: Given 
acceptance regions .A(@) for tests of simple hypotheses of the form 09 = {6}, the 
associated confidence set is the set of -values for which the hypothesis would be 
accepted, i.e. 


C(x) = {0|x € A(8)}. 


If the tests all have level a, we get for the coverage 
do = Po{C(X) =) 6} = Po{A(0)} =1- Po{K(0)} >l-a 


so C(X) is an (1 — a)-confidence set for 0. Similarly, if C = C(x) is a confidence 
set with coverage at least 1 — a, the critical region for its associated test of the simple 
hypothesis Oo = {6} is simply the set of observations for which the confidence set 
does not include 0: 

reEKo = 06€C(a). 
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Then, if C(X) is a (1 — a)-confidence set for 0, Kc becomes the critical region of a 
test of level a for the simple hypothesis because then 


Po{Ko} =1— Po{C(X) 3 6} =1-—dg <1-(l-a)=a. 


Example 7.3. [Continuation of Example 6.13] In the model with fixed coefficient of 
variation, i.e. where individual observations were assumed distributed as N(3, 37) 
with 8 > 0 unknown and n = 10 observations had been seen with t,, = (0.5, 3)', we 
found three approximate 95% confidence intervals for 3, depending on whether we 
used the likelihood ratio statistic A,,, the Wald statistic W,, with the model variance, 
or the Wald statistic Wr with the estimated variance: 


Ca = (1.05,2.41), Cw = (1.10,2.34), Cy, = (0.96, 2.04) 


Suppose we were interested in the hypothesis Hp : 9 = 1. This hypothesis would 
be accepted at a 5% level with W,, as the test statistic, but rejected by the other test 
statistics at that level, since 1 ¢ Ca and 1 ¢ Cw, but 1 € Cy. 


Note that we cannot obtain the p-value from the confidence interval, and we can- 
not obtain the confidence interval from the observed p-value. To get the p-value from 
the confidence intervals, we need the full system of intervals for different degrees a 
of confidence. Similarly, to get the confidence intervals from a system of p-values, 
we need to consider the entire family of tests for hypotheses of the form Ho : 6 = Go 
for Bo € Ry. 


7.6 Test in linear normal models 
7.6.1 The general case 


We consider a linear normal model with X ~ Ny (€,07Jv) where (V, (-,-)) is a d- 
dimensional Euclidean vector space. Different tests appear as we vary specifications 
of restrictions on € and o?. Below we shall derive likelihood ratio test statistics for 
some of these. 


7.6.1.1 Linear hypothesis, variance known 


We first consider the situation where o? is known and € € V is completely unknown. 
We wish to test the hypothesis Ho : € € L, where L C V is an m-dimensional linear 
subspace of V. The MLE of € in the unrestricted model is € = X and the maximized 
log-likelihood is 
5 _ _|x-4)P 
e => —  —_—_ = 
(= 


where we have ignored additive constant terms involving 27 and o?. Similarly, under 


the hypothesis, the MLE is ra = II,(X) leading to the maximized log-likelihood 


0 


|X = €|I? _ _ IX -T IP 
202 202 


“é) = 
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We thus get for the likelihood ratio test statistic 


s A X —IIz(X)||? 
A(X) = 2(0(8) — () = A= ECOI 
From Theorem 2.24, we get that A(X) follows a y?(d — m) distribution leading to 
the p-value 
p= P{A(X) = A(x)} =1—- Fe ™(A(z)) 


where F'¢—" is the distribution function for the y?(d — m) distribution. 


7.6.1.2 Linear subhypothesis, variance unknown 


A more common model has both of c? > 0 and € unknown, assuming € € L, where 
L is an m-dimensional linear subspace of V. We then wish to test the hypothesis 
A: € € Ly, where L, C V is a k-dimensional linear subspace of L. The MLE of 
€ in the larger model is é = II, (X) leading to the profile log-likelihood, maximized 
over f€EL 


7 d |IX-)? |X — Tn (X) |? 
&(€,07) = 5 log a? pga tt 5 log a? ae 


where we have ignored additive constants involving 7, but this time we need the 
terms involving o? as these are not constant. Maximizing again over o? yields 


6? = ||X — Tz (X)||?/d 
leading to the full maximized log-likelihood 


: d |X —Ur(X)? — 4d, IX -Ue(X)I? a 
e(€, 67) = —= logd? =-=1 
(§,6°) = —5 loge 352 5 log 7 5 
Similarly, the maximized likelihood under the hypothesis H, becomes 
IX Me (X)P dy IX UIP _ a 
26? 2 d 2 


leading to the likelihood ratio test statistic 


Rf d ‘K 
&(£,6") = —5 log o* 


A(X) = 2(6(€, 62) — 2(6(€, 8)) = dlog! 


Since we have ||X — II;,(X)||? = ||X — U,(X)||? + ||[,(X) — Uz, (X)||?, we 
may rewrite this as 


A(X) = dlog (1 4 Eee = Hal) 


|X — Tn (X)|/? 


and thus we have A(X) > A(z) if and only if F(X) > f(x) where F is the normal- 
ized ratio of squared distances 


W€-€IP/(m@— A) (fe (X) — The, (X)IP/(m — &) 
|X — Tn (X)|?/(d — m) |X — Wr (X)|?/(d — m) 


F = F(X)= 
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In other words, large values of the squared distance between the estimates are com- 
pared to the squared length of the residual indicated deviations from the hypothesis. 

By Theorem 2.27, ||IIz(X) — Iz, (X)||?/o? and ||X — Uzp(X)||?/o? are in- 
dependent and \?-distributed with m — k and d — m degrees of freedom under the 
hypothesis, so F' then follows an F;,— ;,¢—m distribution. We may therefore calculate 
the p-value for the test as 


p= P{A(X) > A(a)} = PLF(X) > f(a)} = 1— Fin-k,a—m(F(2)) 


where Fy,—%,d—m is the distribution function for an Ff’ distribution with the relevant 
degrees of freedom. 


7.6.2 Some standard tests 


We shall here first mention a number of standard tests of which most may be seen 
as special instances of those derived in Section 7.6.1. These are then all equivalent 
to likelihood ratio tests in the sense that, for each of them, it holds that there is a 
function fh such that 
A(x) > Ao <= d(x) > A(Ao) 

where A(x) is the likelihood ratio statistic as defined in (7.1) and d(x) is the test stat- 
istic in the specific example discussed. In other words, the system of critical regions 
defined by A and D are identical. 

All of these standard tests are of constant level, meaning that the distribution of 
the test statistic is the same for all parameter values conforming with the hypothesis. 
In other words the test statistic is a pivot under the hypothesis, see also the next 
section for further discussion. 


7.6.2.1 Z-test for a given mean, variance known 


We consider a sample X = (Xj,...,X,) from a normal distribution N(p, 07) 
where o? > 0 is known and yz: € R is unknown. The null hypothesis is Hp : 4. = [Uo 
where /i9 is a specific value. The test statistic is d(~) = |Z(a)| where 


Xe, — Ho 
a//n 
which for all jig follows a standard M’(0, 1) distribution so 


P(Z) = Pup t{|4| > 2(2)} = 21 — O(2(@))) 


where ® is the standard normal distribution function. This is an instance of the situ- 
ation in Section 7.6.1.1 applied to Y = X — pol. 


2220) = 


7.6.2.2 T-test for a given mean, variance unknown 


We consider a sample X = (Xj,...,Xn) from a normal distribution N(j, 07) 
where o? > 0 and ys € R are both unknown. The null hypothesis is composite and 
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given as Ho : 1 = [lo where [19 is a specific value, whereas 0? > 0 is also considered 
unknown under the hypothesis. 

When the variance o? is not known, the value of Z cannot be calculated and we 
use instead the test statistic d(X) = T = |t(X)| where 


where 


The distribution of T is for all (j19, 07) a Student’s t-distribution so 
p= (2) = Pyo,o2{|T| > t(x)} = 20. — Fr_i(¢(2))) 


where F'7_, is the distribution function for Student’s T with n — 1 degrees of free- 
dom. This is again a special instance of the situation in Section 7.6.1.2 applied to 
Y = X — pol. Then L = span{1} and L, = {0} and we have T? = F ~ F\,_1. 


7.6.2.3 T-test for comparing means 


We consider two independent samples X = (X1,...,Xm) and Y = (Yj,...,Yn) 
where X; ~ N(u*,o7) and Y; ~ N(u*,o?) are all mutually independent. 
We consider (u*,¥,07) € R x R x Ry to be unknown and we are interested 
in the composite hypothesis Hp : * = sY, whereas o? is considered unknown. 
Then we use the test statistic g(X,Y) = |t(X,Y)|: 


where 7 - 
in (Xi _ Ken)? + ees _ Yi" 


Ss? = 
m+tn—2 


This test statistic follows for all values of 4 = * = jy a Student’s ¢-distribution 
with m+ n — 2 degrees of freedom, so again the relevant p-value is 


p= p(x, y) = 2(1 i FF iy lO) 


This is again special instance of the situation in Section 7.6.1.2. Here V = R™*”, 
L = span{(1,0), (0,1)} and L; = span{(1,1)}. Since dim Z — dim L; = 1, we 
get T? = F. 

It is important that the variances for the two samples are assumed identical. If 
this is not the case, no test of constant level exists and other ad hoc or approximate 
methods must be used. This situation is known as the Behrens—Fisher problem. 
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7.6.2.4 T-test for paired comparisons 


Here we consider a paired sample (X,Y) = (X1,¥1,...,Xn, Yn) where all single 
observations are independent and normally distributed as 


where ra ; py ,i SLeeay ti ass oy) are all unknown. We further assume that 
Wie = wr +6, (7.2) 


i.e. that the difference of means in the two groups is the same for all 2. Note that 
this is not an instance of the standard test in linear normal models unless we further 
assume 0% = o},. For then the family of concentration matrices is not proportional 
to the identity. 

But we may transform the data and let D; = xX; — Yj, noticing that then 
D; ~ N(6,0% + o%-). The problem of unequal variances has now disappeared after 
transforming the data to the set of differences. We are interested in the hypothesis 
Hy : 6 = 0, corresponding to the situation where the means in the two samples are 
pairwise identical. Note that by considering the differences rather than the original 
observation has implied that the composite hypothesis in the original problem has 
become a simple hypothesis in terms of the differences and thus been reduced to the 
simple T-test in Section 7.6.2.2, so we use the test statistic d(X) = |t(X, Y)| where 


D 1 
PHgy= 2), 2=— yO Dd), 
San j=l 


n-1 


which under the hypothesis follows a Student’s t-distribution with n — 1 degrees of 
freedom so the p-value becomes 


p= p(z,y) = 2(1 — Fy_s(¢(z,9))), 


where F'7_, is the distribution function of Student’s T with n—1 degrees of freedom. 


7.7 Determining p-values 


For any testing procedure to be operational, we need to be able to calculate the asso- 
ciated p-value and this section is devoted to methods for doing so. The simplest case 
is when the distribution of the test statistic D = d(X) is the same for all 6 € Oo or, 
in other words, d(X) is a pivot under the hypothesis. We then say that the test has 
constant level, i.e. if it holds that 


yo(d) = Po{D>d}=~7(d),  forall@ € Op andalld ER. 
Then the p-value for an outcome ~ is just 


p= p(x) = 7(d(z)). 
As we have seen in Section 7.6, a number of classical tests satisfy this property, 
including most tests associated with the general linear model, typically leading to 
test statistics following a normal distribution, Student’s T’, the x?-distribution, or the 
F-distribution. 
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7.7.1 Monte Carlo p-values 


Above we have seen a number of examples where we have been able to calculate 
the p-values exactly, as the test statistics happened to equivalent to quantities having 
standard distributions with well-known properties. This is not always the case. How- 
ever, often we are able to simulate from the distribution of the test statistic D = d(X) 
and are therefore able to get a Monte Carlo estimate of the p value. 


7.7.1.1 Simple hypotheses 


Consider first a simple hypothesis Hp : 6 = 09 and assume that we have observed 
Q = d(x) in our original sample. We may then generate a new and artificial sample 
X* = (xj,...,2y) of size N from the distribution Ps, of X and then estimate the 
p-value by the relevant frequency: 


A i. “ ms 1 
Bn = puc(d(2)) = 57 D_ Latw),00)(€(X7)) = FF DY: 
w=1 t=1 


where we have let Y; = 1(4(),00)(d(X;)). Then Y; are independent Bernoulli ran- 
dom variable with success probability equal to the p-value p = p(d(x)) we are look- 
ing for, so we obtain an approximate (1 —@)-confidence interval for p using the Wald 


interval 
Bn + %4~a/2 X Vbw(1 — bn) /N. 


7.7.1.2 Composite hypotheses 


In the case of a composite hypothesis, we may be lucky and have a test statistic 
with constant level so that the distribution of D under the hypothesis is known. If 
this is the case, we essentially proceed as above for an arbitrary choice of 69 € Oo. 
However, if this is not the case, we may use the following procedure which is known 
as parametric bootstrap. First, estimate 6 under the hypothesis, say by maximum 
likelihood: 


= argmaxgce, fx(A). 


Then proceed as above, just simulating the artificial sample X* = (a7],..., 2) 
from the estimated distribution P,. 


7.7.2 Asymptotic p-values 


As an alternative to exact and Monte Carlo methods, we may use the results derived 
in Chapter 5 to enable us to calculate approximate p-values in a wide range of prob- 
lems. 


7.7.2.1 Simple hypotheses 


For the case of a simple hypothesis in a model given by a curved exponential fam- 
ily, Theorem 5.35 yields exactly what we need for the approximate calculation of 
p-values for the likelihood ratio and Wald type statistics. 
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Example 7.4. [Continuation of Example 6.13] In the model with fixed coefficient of 
variation, i.e. where individual observations were assumed distributed as V(3, 37) 
with 8 > 0 unknown and n = 10 observations had been seen with t, = (0.5,3)', 
we calculated the log-likelihood ratio statistic in (6.5) to 


30 _ 10 
Be B 


so the value of the test statistic for the hypothesis Hp : 6 = 1 is 


A,,(B) = + 20 log 8 — 14.776, 


An(1) = 30 — 10+ 0 — 14.776 = 5.224. 
We thus get the approximate p-value from a \7(1) distribution as 
p= P{A,(1) > 5.22} = 0.022 


corresponding to the fact that the 95% confidence interval based on this statistic did 
not contain the value 1, see Example 7.3 
Similarly, using the Wald statistic with model variance we get from (6.6) that 


W,,.(1) = 30/4 = 7.5 


leading to the asymptotic p-value p = .006, yielding the same conclusion. Finally, 


we have 
40 


3 


ks _ 3n_s 


so W,,(1) = 10/3 = 3.33, leading to an asymptotic p-value of 0.068, corresponding 
to the fact that 6 = 1 was included in the last 95% Wald confidence interval. 


(1.5—8)" 


Example 7.5. [Continuation of Example 5.27] We consider again the curved family, 
where the mean of a bivariate normal distribution is assumed to be located on a semi- 
circle in the right half-plane and recall from Example 5.27 that the MLE based on n 
observations is 

Bn = tan! (Zon/Fin) 


provided z,,, > 0. From (3.10) the maximized log-likelihood function based on n 
observations then becomes 


fn(Bn) = no(Bn)' En = N(Fin/R, Fan/R)' En = n||En|\?/R => nRy, 


where R = ||Zn|| = /27,, + Zon 
Suppose now that we wish to test the hypothesis Ho : 8 = 0. The log likelihood 
ratio statistic becomes 


i= (ln(Bn) = n(0)) = InR — 2nF1n = 2n(R— Fin). 


Thus this test statistic has an asymptotic 7(1)-distribution and compares the length 
of the observation to the length of the first coordinate (recall that we assume 1, > 0 
to ensure existence of the MLE). 
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7.7.2.2 Composite hypotheses 


Example 7.6. [Continuation of Example 5.30 and Example 7.5] We may wish to 
make a statistical test for the hypothesis that the mean is actually on the given semi- 
circle. 

In the larger exponential family we assume no restrictions on the mean @ € R?, 
so we have the usual MLE 6 = % = (%1,Z2)' with the maximized log-likelihood 
function being 


2n(On) == So (2a —%)?- So (xi2 — &2)”, 


i=l i=l 


ignoring terms that are constant in 0. If we estimate in the model determined by 


4 = $(8) = (sin B, cos 8)", 8 € (0,7) 


we have seen that 
Bn = tan~'(%2/%1), 


provided that z; > 0. Thus, the log-likelihood function maximized under the hypo- 
thesis Hp : P € Po is 


2n(Bn) = — > (win — 008 Bn)? — > (wa — sin Bn)? = 
i=1 i=1 
= Sea —%)?- Swe — %)? — n(%1 — cos By)? — n(# — sin By)? 
i=1 i=1 


= 2n(6n) — n(#1 — cos By)? — n(%2 — sin Bn)? 


Using now the expressions for cos es sin Bn in (5.17) from Example 5.27 we get if 
Z,>0 


MM, = 2(€n(6n) — €n(Bn)) = n(#1 — cos Bn)? + n(Z2 — sin Bn)? 


= n(a- 2)" +n (a #2) nt € 7) =na- A) 


and this is asymptotically distributed as y7(2 — 1) = x7(1), ie. with 1 degree of 
freedom. The test statistic R measures how much the length R of the observed av- 
erage differs from 1. Note however also that A,, only makes proper sense if %, > 0 
since otherwise the MLE under the hypothesis is not well-defined. 

If we consider the simple hypothesis Hp : 6 = 0) = (1,0)! without assuming 
the semi-circle model, we would get the test statistic 


Ay = 2 (€n(Bn) — fn(60)) = nn — Boll” 
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and here the asymptotic y?(2)-distribution would actually be exact. However, we 


also have 


2 (a(n) — fn(Bn)) +2 (en(Bn) — En(B0)) = Ay + An 


where A,, was calculated in Example 7.5. This may also be verified by the calculation 


Ait+An = n(1—R)? + 2n(R—- Zin) 
= n(l+ R?-2R+2R—-2%1n) 
= n(1+ 22, + 2, — 22in) 


= 7((€in— 1)? + £3_) = nllon — Goll? = An. 


Whereas A” has an exact 7(2)-distribution, the individual constituents A,, and A’, 
are only asymptotically distributed as \?(1). 


We shall also illustrate the testing of composite hypotheses in the case of constant 
coefficient of variation. 


Example 7.7. [Continuation of Example 5.33] Assume that we in this example 
observed ¢,, = (.5,3)' based on n = 10 observations and we wish to investig- 
ate whether the model is correct, i.e. whether the coefficient of variation is actually 
equal to 1, formally formulated as Hp : o? = €?. Without this restriction we have 
fim = Tn so—ignoring the constant 27—the log-likelihood becomes 


XG — Xye 


2 n(n) = as) n log 6? = —n(1 + log 6°) 
oO 


For the specific example, we get 
6? = (SS/n— S?/n?) =3-—1/4=2.75, 2€n(fn) = —20.12. 
Further, the maximized log-likelihood assuming the model is correct is 
A 4 SS, 2Sy a 
Qn (fin) = 2ln (Bn) = —-—— + = = nlog 8, = —14.78. 
Br Bn 
So we get A, = 20.12 — 14.78 = 5.34 with an asymptotic p-value at p = 0.021 
judged in a y?-distribution with 2 — 1 = 1 degrees of freedom. So the hypothesis 
cannot reasonably be maintained. 


Alternatively we could consider the Wald statistic, and we shall first look at the 
version where the covariance is estimated under the hypothesis. We get 


N(An _ fin) a (An ~ fin) 


7 10 ie) 108 —24\ f -1 
POR SAN cob GG) Vat 


= 10(108 — 36 — 36 + 81/2)/81 = 85/9 = 9.44. 


WwW, 
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Corresponding to a p-value of p = 0.002 as we also here use a y?-distribution with 
2 — 1 = 1 degrees of freedom, leading to the same conclusion as above. 

If we estimate outside the hypothesis we get 6? = 11/4, é = 1/2 and thus using 
expressions derived in Example 3.21, the estimated covariance becomes 


x 62 262€ tipoy 9 
ue g2 = Vee 2 ~\on0¢ on is Feet 
’ X 26°€ 264 + 46° 8 \2 13 


which has inverse 
$-1 8 a 13. -—2 
m 121 \_9 9 


: 40 12 So ft 
Wr, = —(-1,-15 
a! & 2 & 


= 20(26 — 12+ 9)/121 = 430/121 = 3.55. 


SO 


This yields a p-value of p = 0.059 using a x?-distribution with 2 — 1 = 1 degrees 
of freedom, so here we do not clearly reject the hypothesis. It is typically true that 
the Wald statistic with covariance estimated outside the hypothesis loses power com- 
pared to the other test statistics due to the extra variability in the estimate of the 
covariance. 


7.7.2.3 Smooth hypotheses 


The theorem identifying the .?-distribution of the log-likelihood ratio statistic can 
be generalized to the slightly wider case of a smooth hypothesis of order d: 


Ho : h(B) =0 (7.3) 


where h : B — R™~¢ is a smooth map with Jacobian having full rank m — d 
for all 8 € B with h(3) = 0. This fact is sometimes also referred to as Wilks’ 
theorem although Wilks (1938) actually was giving the slightly weaker version in 
Theorem 5.39. 

The hypothesis Hp may, for example, be of the form that certain parameters in 
the larger family have the value 0: 


Fo : Basi = 0, tee » Pm = 0, (7.4) 


in which case we have a = I4(B) where IIq is the coordinate projection onto the 
first d coordinates of 6 € B. Then the map ¥ is simply 


(OH eas 5.5 Mg) (Cin ony Cg Oy o2+50) 


in which case the smooth family is also a curved family. If the hypothesis has the 
form (7.4) we say the hypothesis has standard form. 
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Theorem 7.8. Let Ho : h(3) = 0 be a smooth hypothesis of order d in a curved 
exponential family P = {Py g),8 € B} of dimension m. Then the maximized log- 
likelihood ratio statistic Ay, for the hypothesis satisfies 


A, 3 x2(m— 4) 


with respect to any P € P = {Pyg)|8 € B,h(8) = O}. As before, x?(m — d) 
denotes the x?-distribution with m — d degrees of freedom. 


Proof. Clearly, a curved composite hypothesis as in Theorem 5.39 is a special case of 
a smooth hypothesis, since a curved hypothesis can always be represented in standard 
form (7.4). 

But, in fact, the implicit function theorem exactly says that the converse is true, 
although only locally. If 39 satisfies the hypothesis, i.e. if h( G9) = 0, there is a neigh- 
bourhood around ¢(9) which has a smooth parametrization as a curved subfamily. 
Since, asympotically, only the local structure around ¢(3o) matters, Theorem 5.39 
therefore essentially covers the smooth case as well. We refrain from giving the full 
technical details. 


Example 7.9. [Bivariate normal with mean on circle] A simple example of smooth 
hypothesis is a modification of Example 3.28. We again assume 


X ~ No(6, In) 
for © = R? but assume now further that the mean @ lies on the unit circle, i.e. 
Ho : 0; + 65 = 1. 


Note that this differs from Example 3.28 where we only considered a half-circle. 
This subfamily does not correspond to a curved exponential family since there is no 
smooth homeomorphism from an open interval in R to the unit circle. However, as 
we shall see, it does not really matter. 

The Jacobian of the map (x,y) > 2? + y? is (2x, 2y) and hence this has full 
rank everywhere except at (z,y) = (0,0). Going through the same mimimization 
exercise as in Example 5.27, we get that under Ho, the MLE is uniquely determined 
if and only if A 0 and then 


+ 1 L 
es bs z/R. 
of a2 + £3 \ Ze 
The log-likelihood ratio test statistic here is exactly the same as in Example 7.6: 
An = n(1— R)? © x7(1), 


we leave the details to the reader as an exercise. The difference is now that the test 
statistic A, now makes sense if only (Z1,Z2) A (0,0), since the MLE always exists 
and is well defined. 
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7.8 Exercises 

Exercise 7.1. Show that the simple T-test in Section 7.6.2.2 is equivalent to the 
likelihood ratio test in a suitable linear normal model. 


Exercise 7.2. Consider comparing means as in Section 7.6.2.3 and show that the test 
described is equivalent to the likelihood ratio test when variances in the groups are 
assumed equal. Show that if the variances are not assumed equal, the projection onto 
the space of identical means depends on the ratio of variances. 


Exercise 7.3. The data below are based on Fisher (1947) and represent the weight 
of the heart in 16 cats as a percentage of the weight of the body. Do you see any 
difference between male and female cats? 


Gender Percentage of heart weight 
Female 0.276 0.247 0.288 0.274 0.200 0.276 0.242 
Male 0.275 0.221 0.280 0.197 0.283 0.260 0.226 0.289 0.206 


Exercise 7.4. Consider the situation with paired comparisons as in Section 7.6.2.4. 
Show that the suggested test based on differences is equivalent to the likelihood ratio 
test based on the original observations if and only if the variances are assumed equal. 


Exercise 7.5. Consider the Pareto distribution with fixed threshold c (c = 1 in Exer- 
cise 3.3) and index parameter 0 > 0: 


6c? 
fo(x) = O41” for” > c. 


The journal Forbes Magazine lists every year the world’s largest personal assets. 
The data below are from 2017 and 2018 and gives the values of personal assets above 
50 billion U.S. dollars, as listed by Forbes Magazine. 

Assets in billions of US dollars 
2017 86 76 73 71 56 54 52 
2018 112 90 84 72 71 70 67 60 58 


Assuming that these are Pareto distributed with threshold c = 50 and index para- 
meter 0; > 0,2 = 2017, 2018, we would like to test the hypothesis that the index 
parameter is unchanged from 2017 to 2018, i.e. Ho : 82017 = 92018. 


a) Calculate the likelihood ratio test for this hypothesis, but calculate the p-value 
both via Monte Carlo methods and by asymptotic approximation. 


b) Under the assumption that Ho may be upheld, determine a 95% confidence inter- 
val for the index parameter 0. 


c) Would the hypothesis H, : 62917 = 02013 = 1 be compatible with the data from 
Forbes magazine? 


Exercise 7.6. Two methods for analysing starch content in potatoes are to be ana- 
lysed and compared. Sixteen potatoes with varying content of starch are measured by 
the two methods with results displayed in Table 7.1. Are the measurement methods 
comparable? 
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Table 7.1 — Starch content in potatoes in % as measured in two different ways. Source: 
Hald (1952) based on von Scheele et al. (1935). 


Sample MethodI Method II 


1 21.7 21.5 
2 18.7 18.7 
3 18.3 18.3 
4 VS 17.4 
m) 18.5 18.3 
6 15.6 15.4 
7 17.0 16.7 
8 16.6 16.9 
9 14.0 13.9 
10 17.2 17.0 
11 21.7 21.4 
12 18.6 18.6 
13 17.9 18.0 
14 17.7 17.6 
15 18.3 18.5 
16 15.6 1559 


Exercise 7.7. Let X and Y be independent and exponentially distributed random 
variables with E(X) = A, and E(Y) = Az where \ = (Aj, A2) € R4. and let 
(X1,Y1),---,(Xn, Yn) be a sample from this distribution. Consider the hypothesis 
Ho : AyAq = 1 as in Exercise 3.9, Exercise 4.10, and Exercise 5.10. 

a) Determine the log-likelihood ratio statistic for the composite hypothesis Ho. 


b) Determine a Wald test statistic for the composite hypothesis based on the fact that 
the hypothesis may be expressed through the parameter function 


(A) = log A1 + log A2 


as Hy : (A) = 0. 
c) Does the following (simulated) sample support the hypothesis Ho? Use both the 
likelihood ratio and Wald test statistics as derived above? 


x 0.581 0.621 3.739 4.354 0.409 1.843 1.705 0.312 
y 0469 0.084 0.362 2.050 2.704 0.034 0.061 0.419 


Now we shall investigate the simple hypothesis H; : A; = Ag = 1 within the model 
determined by Hp. 


d) Derive the likelihood ratio statistic for 1; under the assumption of Ho. 
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e) Derive the score test statistic for H, under the assumption of Ho. 
f) Are the data under c) compatible with H,? 
g) Determine the Monte Carlo p-value for the above tests. 


Exercise 7.8. Let _X1,..., X,, be asample from a gamma distribution with unknown 
shape and scale. 


a) Derive the likelihood ratio and Wald test statistics for the hypothesis Hp : a = 1, 
i.e. the hypothesis that the observations are from an exponential distribution. 
b) Calculate the asymptotic and Monte Carlo p-value for the hypothesis Ho when 
the following (simulated) data have been observed 
0.469 0.084 0.362 2.050 2.704 0.034 0.061 0.419 


Exercise 7.9. Suppose a multiple choice exam is to be constructed, with four pos- 
sible answers in each category. If we make it a pass criterion that 62.5% of the ques- 
tions must be correctly answered, how many questions do we need to ensure that a 
student who is merely guessing is failing with probability 0.999? 


Exercise 7.10. A clinical trial is to be conducted with a new medicine for controlling 
hypertension. A total of 2n patients who suffer from high blood pressure will be ran- 
domized into two groups of equal size, treatment and control. Patients in the treat- 
ment group are given the new medicine and patients in the control group are given 
a traditional medicine. The blood pressure of the patients is measured (in mm Hg) 
at the beginning of the trial, and three months later. The change in the average of 
systolic and diastolic blood pressure is recorded for each patient. The standard devi- 
ation o for differences of measurements of this type is known to be around 10 mm 
Hg. How many patients are needed to detect a difference in effect of 10mm Hg with 
probability .99? 
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Chapter 8 


Models for Tables of Counts 


In this chapter we shall further investigate a number of specific and much used mod- 
els for data given in the form of tables of counts. The models considered are all either 
curved or regular exponential models and the chapter therefore illustrates the use of 
the concepts developed in the previous chapters. 


8.1 Multinomial exponential families 
6.1.1 The unrestricted multinomial family 


We consider n objects, classified into & + 1 groups which we here shall considered 
labeled as Y = (0,1,...,4). We may thus think of corresponding random variables 
as X1,..., X» with values in V. We shall consider these independent and identically 
distributed with 7, = P(X; = x) denoting the probability that an object falls in the 
category x € ¥. 

We further assume that 7, > 0 forall ¢ € ¥ so m = (m0,..., 7%) is an element 
of the open k-dimensional simplex A*: 


k 
reat= (wert re > 0.26%, Dome =i), (8.1) 
x=0 
We note also that A* is smoothly parametrized with ™o = (T1,---,7) So that we 


may write 


Af w Ak = & eR* 


k 
Tz > 0,2 € Xo, be 2 : 


xr=1 


where X\o = X \ (0). To identify this as an exponential family, we may for 7 € A* 
write the density of a random variable X as 


k k 
fr() = [J © =exp | ¥~6)1y)(2) — 4) | = exp (67t(z) — ¥(0)) 
j=0 j=l 
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with 0; = log(m;/70),j =1,...k, (2) = (1@)(2),---, 1(@)(@)), and 


k 
(0) = log [1+ S-e% | = —log zo. 
j=l 


This identifies the family of multinomial distributions on VY as a k-dimensional 
regular exponential family with moment map 
es 


1+ ae efi” 


75 (0) = Eo(t;(X)) = Po(X = 9) = 7; 


JE XO 


that is a diffeomorphism between 0 = R* and A‘ with inverse 


= log Ti 
TO 


Tj 
k 
TG 


We thus note that if we let Y; = 5>;"_, t;(X;) denote the number of objects in 
category 7, we have the MLE determined by the corresponding moment equation 
and hence 


T*(m\o); = 0; = log 


tj =Yj/n, j=,...,k. 
Since m9 = 1— D°_1 7; and 0 <4 Yj = n, we have Yo = n — Dijea, Yj and 
therefore in fact also 


k 
ftp =1 — S04; =Yo/n. 
j=l 


The maximum-likelihood estimator is well-defined if all categories have been 
observed, i.e. if Y; > 0 forall j =1,...,k. The distribution of Y = (Yo,..., Yx) is 
multinomial: 


k k 
n n! 
riya) = ("Ie -s It 
YOr ++ YRS 56 Yo Ykr 55 
= exp oT Sua.) — nv) i 
t=1 yo! yr! 


The covariance becomes 


Vo(¥;, Ys) = 
which can be obtained by differentiation of 7 or directly, using that we in fact have 
t;(X)? = t;(X) since t; has values in (0,1). If we denote this covariance by © 
we may write 


T\0? 


Vo(t(X)) = Um, = D(mo) — T™oT\9 
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where D(7\9) is the k x k diagonal matrix 


Ty 0 0 
0 2 + 0 

Dim)=|. 2. .|- (8.2) 
0 0 tes Th 


We then have 


Lemma 8.1. The inverse covariance matrix for t(X) is given as 


2 1 
Ul = D(1/mo) + —Iply. 
0 


T™\O 


Proof. This follows from direct matrix multiplication. Using that 


k 
mole = Som =1- 7 
1 


we have 
1 T 
Uno D(1/mo) AR met = 


1 1 
D(mo)D(1/mo) — mom D(1/mo) + a motel a NoMolele 
1 


1 —T 
“al: alt 0 i 
— Ii, — Mol, + mo Ot a ™ oly, = Ty, 


as desired. 


6.1.2. Curved multinomial families 


In many examples we may know or postulate more about the structure of the vector 
a of probabilities, then typically specified as an m-dimensional curved subfamily of 
the multinomial family, i.e. we have 


where ¢ satisfies the conditions in Definition 3.24. It follows that the asymptotic 
theory for maximum-likelihood estimation (MLE) and testing developed previously 
applies, so that Bn is asymptotically well-defined and asymptotically distributed as 
N(8,i(8)~'/n); we just have to solve the score equation (often numerically), es- 
tablish whether the solution corresponds to a global maximum of the score equation, 
and calculate the Fisher information. 
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6.1.2.1 Score and information 


We let 
75(B) = 7)($(8)) = Facey (ti(X)), 7 € 0, 0(8) = Ps(X = 0) 
and get for the log-likelihood and score functions for a single observation 


Dr,(8) 
Tx (8) 


and further for the Fisher information, using the second Bartlett identity in 
Theorem 1.23: 


E(x, 8) = logms(8),  S(#,8) = (8.3) 


k 


i(8) = Bs(S(X, 8)" $(X,8)) = > 


j=0 


Dr;(8)' Dr;(8) 
(sais 


For the case of n observations, we then get from (8.3) that the log-likelihood and 
score function is 


k 
t9(8) = Y°¥jlogns(8), S(8) = YY, (8.5) 
j=0 j 


j=0 


8.1.2.2 Likelihood ratio 


If we let P denote the set of distributions in the unrestricted multinomial family 
and Po the set of distributions specified by the curved multinomial subfamily, the 
likelihood ratio test for the composite hypothesis Hp : P € Po becomes 


An =2 (€n(A) ~ &x(8)) =29°¥5 log 22 


k 
B - 
[en = 257 OBS; log CPS (8.6) 
j=0 75(8) j=0 


EXP, 


where OBS, is the observed number of objects in group 7, EXP; = n7;(3) is the 
expected number of objects in category 7, and the log likelihood ratio A,, is asymp- 
totically distributed as x2(k — m). 

Note that, strictly speaking, this test statistic is not well-defined unless it holds 
that OBS; = Y; > 0 for all 7. However, if we use the convention that 0 - log0 = 0, 
the statistic makes sense just if EXP; 4 0. 


8.1.2.3 Wald statistics 


The Wald statistic for the same hypothesis would be given as 


Wn =n(ro—m(Bho) Baty | (Ao—x(B)0). 
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Using Lemma 8.1 yields a simpler expression for W,,: 


Wr = n(io—m(Bo) D(1/n(8\0) (#0 — 7180) 


eFo(B) 
k a x 
el (Yj /n — 1;(8))? (ato — mo(B))? 
7 3s 1;(B) 7o(B) 
k * k 
= aS rr ae = » EXP, (8.7) 


The Wald statistic in (8.7) is also known as Pearson’s x? and has an asymptotic y- 
distribution with / — m degrees of freedom as was also true for the likelihood ratio 
A» in (8.6). In fact, since this Wald statistic is based on the mean value parameter, 
it is also equal to the quadratic score statistic and the expression in (8.7) is therefore 
invariant under reparametrization. This may also be seen directly as the last expres- 
sion is free from terms that depend on the specific parametrization of the two models 
involved. 

In the statistical literature on the analysis of tables of counts, see for example 
Agresti (2002), it is common to use the notation G? for A and X? for W which 
we shall also do here. Also, we shall not in any detail discuss issues relating to the 
quality of the .?-approximation to the test statistics, but just mention that experi- 
ence has shown that the approximation is adequate whenever EXP; > 5 for all 7. 
If this is not the case, Monte Carlo methods may be used. Sometimes the condition 
EXP; > 5 may be achieved by joining categories with small expected numbers, see 
Example 8.2. 

For the sake of completeness, we display the alternative Wald statistic where the 
variance is estimated in the larger model: 


2 T n 
Wr = n(mo-m(A\o) Bai (Ao- Bho) 
k 4 k 
(¥; — nm;(8))? (OBS, — EXP;)? 
=> Y, =D5 OBS, 
j=0 j=0 : 


This differs from Pearson’s x? by using observed rather than expected counts in 
the denominator. However, this is rarely used; the computational effort is mostly 
associated with calculating EXP,;, and thus there is no essential computational saving 
compared to calculating G? or X?. 


8.1.3 Residuals 


To check the validity of a model used, it is useful to supplement the numerical test 
statistics in (8.6) and (8.7) by inspecting appropriate residuals representing the dif- 
ference between the observed and expected values, properly normalized. Inspection 
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may be done graphically, to disclose systematic unexpected patterns, or numerically, 
by identifying residuals that are particularly large, possibly representing deviations 
from the model used that may then be further investigated. The Pearson residuals are 
the quantities 


A 


Rj = 


and Pearson’s x? statistic is then equal to W =). R3. 

The Pearson residuals are asymptotically normally distributed with expectation 
0, but their asymptotic variance is smaller than 1 and equal to 1 — hj;, where hj; are 
the diagonal elements of the hat-matrix of the model as given in (5.21), i.e. the matrix 
for the orthogonal projecton of y onto the space of tangents to the model with respect 
to the inner product determined by the Fisher information, see Lemma 5.32 for fur- 
ther details. It may therefore be more useful to consider the standardized Pearson 
residuals that correct for this reduced variance, i.e. 


OBS; — EXP; 


R= = 
EXP, (1 — hj;) 


8.1.4. Weldon’s dice 


We illustrate the previous developments in a simple example, described in the classic 
text of Fisher (1934). 


Example 8.2. [Weldon’s dice data] Walter Frank Raphael Weldon (1860-1906) was 
a Professor of Zoology at Oxford University and worked on statistics with Frank 
Galton and Karl Pearson. Weldon rolled a set of 12 dice 26306 times (!) and recorded 
how many of the 12 dice were facing 5 or 6. The resulting data are displayed in 
Table 8.1. The total number Y with faces 5 or 6 when rolling 12 dice should follow a 
binomial distribution with length 12 and success parameter 0, yielding the expected 
numbers 


1;(0) = PAY = 7) = (*) oa—ey**. 


If the dice were fair, we would have 6 = 1/3 and the expected numbers would be 


j 12-j 
exp, = 2a x ("2) (2)' (2) 
J 


and these numbers are displayed in the third column of Table 8.1. It is apparent that 
the observed counts are too small for 7 = 1,...,4 and too large for 7 > 4, also 
reflected in the Pearson residuals, displayed in the fourth column. 

Pearsons 7 for goodness-of-fit is equal to the sum of squares of these residuals 
and evaluates to X? = 35.49 when categories 10, 11, and 12 are combined to achieve 
reasonably large expected numbers. Compared to a y?-distribution with 10 degrees 
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Table 8.1 — Weldon’s dice data: the number of dice among 12 which face 5 or 6 together 
with their expected number under the assumption that the dice are fair or not and the 
associated Pearson residuals. Residuals 1.96 or higher are indicated with bold face type. 
Categories 10, 11, and 12 are combined for the calculation of residuals. Source: Fisher 
(1934). 


#5 or6 OBS EXP fair EXP biased | Residual fair Residual biased 
0 185 202.75 187.38 -1.25 -0.17 
1 1149 =: 1216.50 1146.51 -1.94 0.07 
2 3265 3345.37 3215.24 -1.39 0.88 
3 5475 5575.61 5464.70 -1.35 0.14 
4 6114 6272.56 6269.35 -2.00 -1.96 
5 5194 5018.05 5114.65 2.48 1.11 
6 3067 =. 2927.20 3042.54 2.58 0.44 
7 1331 1254.51 1329.73 2.16 0.03 
8 403 392.04 423.76 0.55 -1.01 
9 105 87.12 96.03 1.92 0.92 
10 14 13.07 14.69 
11 4 1.19 1.36 0.97 0.47 
12 0 0.05 0.06 

26306 26306 26306 | X? = 35.49 X? = 8.18 


of freedom—there are 11 categories when the last three are combined into one— 
yields a p-value of 0.00001 and we must reject the hypothesis that the dice are fair. 
The likelihood ratio statistic G? = 35.10 is not much different. 

Alternatively, we could try to stick with the binomial distribution, but now estim- 
ate the success probability which here may be calculated to 


6 = 106602/315672 = 0.3377 


which is slightly larger than 1/3. Expected values calculated for this value of 0 are 
displayed in the fourth column of Table 8.1 and yield a much better description of 
the observed values, as reflected both in the Pearson residuals in the last column and 
in Pearson’s 7 which now evaluate to X? = 8.18 yielding a p-value of 0.52 when 
compared to a x?-distribution with 10 — 1 = 9 degrees of freedom, thus indicating 
that the data conform well with the binomial distribution. Here the likelihood ratio 
statistic is indistinguishably equal to G? = 8.18. 
Interestingly, the 95% Wald confidence interval for @ evaluates to 


6 + 1.96,/ 6(1 — 6) /26306 = (0.3320, 0.3434) 


which does include the value 1/3 and hence from the frequency alone, we cannot 
conclude that the dice are unfair. Rather it is the systematic deviation observed in the 
binomial distribution that reveals the problem. 
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The data have been analyzed much by several authors over the years and different 
explanations have been suggested for the phenomenon, including the fact that some 
7000 throws or so were actually made by an assistant. Since the opposite faces of 
4, 5, and 6 are 3, 2, and 1, and the eyes of a die typically have been carved out, the 
carvings might just affect the center of gravity of the die to favour the larger number 
for the smaller. However—as far as I am aware—nobody has yet gone through the 
trouble and repeated Weldon’s experiment. 

We may also here emphasize that although the deviation from the binomial model 
with success parameter 1/3 is significant, it is so small that it for most practical uses 
is unimportant, see also Example 7.1 for a similar situation. 


8.2 Genetic equilibrium models 


Many models for multinomial observations originate from genetics, not least because 
R. A. Fisher was a geneticist and developed much statistical methodology with the 
purpose of solving scientific problems within this field. In this section we briefly 
describe two classical genetic models. It is amazing that these models were developed 
much before DNA was discovered and have survived the molecular revolution in 
genetics, just with a more precise understanding of what a gene is. 


8.2.1 Hardy-Weinberg equilibrium 


We consider a so-called trait that is determined by a single gene which might come 
in exactly two varieties, say a and A. These varieties are termed alleles. Humans are 
diploid individuals, meaning that chromosomes come in pairs, one inherited from 
the father, and one from the mother, so at a specific locus (position on the genome), 
the genetic composition might be any of aa, aA, and AA and this composition is 
referred to as the genotype. 

The study of so-called SNPs (Single Nucleotide Polymorphisms) is an import- 
ant part of modern genetics; then the allele A would represent a possible mutation 
on a specific location on the genome, whereas the allele a would represent the most 
common variant. For example, a could correspond to the typical and most common 
nucleobase pair in the population at that locus, say C-G, and A might correspond to 
a mutation to the other possible nucleobase pair A-T. Here A, C, G, and T are abbre- 
viations of the names of the nucleobases Adenine, Cytosine, Guanine, and Thymine 
which are the main building blocks for the DNA strings in a chromosome. 

Itis technically difficult (although not impossible) to identify which of two alleles 
on a locus originate from the father and which from the mother, so only the genotype 
of an individual is easily observable. 

A simple genetic model for the frequency of these genotypes is known as Hardy— 
Weinberg equilibrium. It can be shown that if members of a population mate at ran- 
dom and there are no selection or other adverse effects, then the distribution of the 
genotypes of a single individual will be as if the alleles a and A are allocated com- 
pletely at random, so if we let 8 € B = (0,1) denote the relative frequency of the 
allele A in the population, the probabilities of the genotype X of a random individual 
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would be 
P{X =aa)} = (1-8), P{X = Aa} =26(1- 8), P{X = AA} = 6? 


or, in other words, the number of alleles of type A in an individual follows a bino- 
mial distribution of length 2 and probability of success @. Thus the genotype of an 
individual follows a multinomial distribution with parameters 


m™(8) =(1— 8)’, (8) = 2611-8), m2(8) = B?. 
This function is a smooth and injective homeomorphism and has Jacobian 
Dr(8)\o = (2 — 46, 28) 


which has full rank for 6 € (0, 1); so the above specifies a curved multinomial model 
of dimension 1. 

Observing genotypes (Yo = yo, Y1 = y1, Y2 = ye) of n individuals, the log- 
likelihood function becomes 


£(8) = (2yo + yr) log(1 — 8) + (y1 + 2y2) log(B) + ys log 2 (8.8) 
yielding the score equation 


(Qyo+y1) Yrt2yo 


=0 
1-28 B 
with the unique solution 
b= wit2ye  _ wit 2ye 
2(yo + yi + y2) 2n 


which is a valid solution if and only if 0 < y; + 2y2 < 2n. Since this is the only 
stationary point, and the log-likelihood function satisfies 


lim £(8) = lim (6) = —o, 


B-0 Bo1 
the solution to the score equation must be a global maximum, and hence, we have 
identified the MLE. The MLE is asymptotically well-defined since 
lim Ps{0 <¥1+Yo< 2n} =", 
n—->co 


The Fisher information becomes 


_ Es((2%+%)) , Bo(¥i+2¥2)) _ 2m 2n_ tn 
~~ he) B? ~ G8)" pas) 


so we have 8 ~ N'(8, 8(1—8) /(2n)). This could also have been derived by realizing 
that Y; + 2Y> is the number of alleles of type A among 2n alleles, with 3 being the 
frequency of allele A. 


in(B) 
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If we now wish to test the hypothesis that a population is in Hardy—Weinberg 
equilibrium based on the observations, we just note that OBS; = y; and the expected 
numbers are 


(2y0 +41)? 
4n 


2 2 2yo)? 
(2y0 + ui)(t1 + 242) Ey, — (yi + 2y2)" 


EXP) = 
a 2n 4n 


, EXP; = 
These can now be used either to calculate the likelihood ratio statistic G? = A using 
(8.6) or the Pearson x? statistic X? using (8.7). 

Example 8.3. [Cod in the Baltic Sea] We shall illustrate this with data from Sick 
(1965) where 86 cod were caught near the island of Bornholm and genotyped with 
respect to their haemoglobine type resulting in y = (14, 20,52). We get 


B = (2042 x 52)/(2 x 86) = 124/172 = 0.721 


and the corresponding expected numbers become EXP = (6.7, 34.6, 44.7) leading 
to A = G? = 14.45 and X? = W = 15.31. These numbers should be judged in an 
asymptotic y7(1) distribution, both leading to p ~ 0.0001 so we conclude that the 
cod population at this location was not in Hardy—Weinberg equilibrium. 


8.2.2. The ABO blood type system 


A slightly more complicated genetic model is associated with the ABO blood type 
system. This trait is determined by a single gene on the 9th chromosome that has 
three types of allele: A, B or O. Here the A and B genes are co-dominant over O, so 
only persons with genotype OO has bloodtype O, whereas persons with genotypes 
OA or AA have bloodtype A, OB or BB have bloodtype B, and genotype AB 
yields bloodtype AB. 

If the population is in Hardy—Weinberg equlibrium, genes can be assumed alloc- 
ated at random to individuals. Thus if we let p denote the frequency of allele A and q 
the frequency of allele B, we get the following probabilities for the four observable 
bloodtypes 


m™ = (1—-p—q), m =p? +2p(l—p—q), m=¢ +2q(1—p—Qq), m3 = 2p¢. 


The Jacobian becomes 


2(1—p—4q) —2p 
Dr\o(P,q) = —2q Ap =g) 
2q 2p 


which has full rank 2 for p,q > 0 and p+q < 1 since if we have \1q + Aop = 0 we 
get Ay = —A1q/p and thus if 


Ai(1 — p— q) — Aap = Ax (1 — q) = 0 


we must have A; = 0 and therefore also \y = 0. Here the maximum-likelihood 
estimator must be determined numerically. 
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Example 8.4. [Blood types of Danish individuals] Data on blood types of 1266 
Danish individuals given in Hansen (2012, page 234) yield the counts 


y = (535, 547, 140, 44), 


leading to maximum-likelihood estimates (,@) = (0.270, 0.76) and thus expected 
values 
EXP = (541.5, 539.4, 133.2, 52.0). 


For the test statistics, we get G? = 1.82 and X? = 1.76 with corresponding p-values 
0.185 and 0.178 when compared to a y?(3 — 2) = x?(1) distribution. Thus there is 
no reason to doubt the equilibrium model. 


8.3. Contingency tables 


In this section we shall discuss a variety of models for cross-classified data also 
known as contingency tables. 


8.3.1 Comparing multinomial distributions 

Here we consider related multinomial distributions and we wish to investigate 
whether they are identical. 

8.3.1.1 Comparing two multinomial distributions 


We first treat the case where we have mutually independent random variables 
Xo1,---;Xon and Xq1,...,Xin, with values in V = (0,1,...,&) with 


To = (Tor, t € X), 1 = (TM12,U € X) 


where 7,71 € A*, the k-dimensional simplex as defined in (8.1). Each of the two 
families associated with 7, and 72 are regular exponential families, as established 
in Section 8.1, so the joint family is an example of an outer product as described in 
Section 3.4. The canonical parameter space is 


© = Oy x 0, = R* x R*, 
the canonical parameter is 


6 = (40,91), 0;; = log ~2, 1=0,1;j=1,...,k 
70 


and the canonical statistic Y is 


Ni Ni 


Yij = a tj(Xim) = ys 1;(Xim) =OBS;;, i=0,1;j=1,...,k 
m=1 m=1 
so ie, 
Tj 1=O0, 157 = ’ k 
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As before, the MLE of the unknown probabilities are the relative frequencies 


Yi; — OBS;; : 
tij= = 3 O13= ’ Jk 
ry 7} 
where 
Y= 6 OGn) = Soya) S OBS 5s S01 SO k 
m=1 m=1 


is the total number of observations in category 7 for group 7. The MLE is well-defined 
if and only if Y;; > O for all 2, 7. 

We are interested in the hypothesis of homogeneity, i.e. that the distribution over 
categories is the same for the two groups Ho : 7 = 7 or, equivalently, Hp : 09 = 
61, corresponding to the direct product of the multinomial families. This is a regular 
exponential family with canonical parameter space = = R* where € j is the common 
value of the canonical parameters in the larger model 


£; = 00; = 01; = log “4, t= 0,059 HT hi 
0 
We may think of the smaller model as classifying n = no + nz objects in the cat- 
egories given by ¥ and thus if we let Y,; = Y,; + Yo; be the total number of objects 
in category 7, the MLE F of the joint probability distribution under the hypothesis is 


iy = 4, 1=0,1;7=0,...,k. 


Using that log-likelihood functions for product models are additive as in (3.7) and 
(3.8), we now obtain the two maximized log-likelihood functions 


k k 
Yo; Yi; 
= $0 Yj log + $0 Vi; log — 
: no ; ny 
j=0 j=0 
and 


(a) = Si oe 2 BY, log) = a log 2 


j=0 


leading to the me erre ratio statistic 


G = Se) 


= (Sy. stop Sy, slog 


1=0 j=0 i=0 7=0 


1 k 
BS; 
23° PY log 2 jn aps OBS, log Fp 


i=0 j7=0 i=0 7=0 
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Table 8.2 — Weight of 300 trout caught in 1958 and 1960 in a Danish lake. Data from 


examination at Aarhus University June 1963. 


Weight of trout in grams 
Year 0-99 100-199 200-299 300-399 400+ Total 


1958 12 36 74 19 9 150 
1960 16 43 70 15 6 150 
Total 28 79 144 34 15 300 


where we have exploited that the expected number of objects in category 7 for group 
7 under the hypothesis is 


: Vy 


The log-likelihood ratio statistic has an asymptotic y?(2k — k) = x?(k) distribution 
under Ho for n — oo by Theorem 5.39; as does the corresponding Wald statistic 
(Corollary 5.38), leading also here to Pearson’s y? statistic 


1 k 
(OBS,, — EXP;;)? 
X°?=W, = j ia 


This follows from (8.7) and the fact that the independence of the two samples implies 
Wr = Won + Win. 


Example 8.5. [Weight distribution of trout] A factory was built in 1959 at a lake 
in Denmark, and its waste water after cleaning was released into the lake. To in- 
vestigate the effect of the waste water on the lake environment, the weight of 150 
trout caught in 1958 and 150 trout caught in 1960 were recorded. The results are 
displayed in Table 8.2. The likelihood ratio and Pearson test statistics are 2.38 and 
2.37, respectively, and the associated asymptotic p-value is .67 when compared to 
a y?-distribution with 4 degrees of freedom, so there is no significant evidence of a 
change in the weight distribution. We note, however, that the distribution does seem 
to have shifted towards lower values but this type of systematic trend is not picked 
up by our analysis. We refer to the fact that the weight categories are ordered by 
saying that the weight is an ordinal variable. This suggests the use of alternative test 
statistics, see further in Section 8.3.6. 


8.3.1.2 Comparing two proportions 


A special case of the above is when || is equal to two, so that 0 = (90,01) € R? 
with Yo, ~ binom(no, 7) and Yi1 ~ binom(n1, 71). 

Then the hypothesis of homogeneity may be formulated as Hp : 79 = 71, so 
we in effect are comparing two proportions. We shall then show that Pearson’s y? 
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statistic simplifies. To see this, we calculate OBSo9 — EXPoo as 


Y+o _ YooYi1 — YorYio _ det Y 
OBSo9 — EXPo9 = Yoo — no +0 _ tooti o1¥io _ de . 
n 


n n 
Y Y 

ya [*o *o 
Yio Yu 


is the 2 x 2 matrix forming the contingency table. We have used that 


where 


n = Yoo + Yio + Yo. + Yi. 
We also have 
OBSoo — EXPo9 = EXPo1 — OBSo1 = OBS1, — EXPy; = EXPi9 — OBSio 
and hence Pearson’s x? simplifies to 


> (det Y)? 1 i 1 i 
XA" = 5 + + + 
n EXPo9 EXPo 1 EXP 10 EXP 11 


Note that this also means that there is only a single standardized residual to consider, 


R* 


_ OBSoo — exe | 1 1 1 1 


n EXP 9 a EXP 91 - EXP 19 * EXP,’ 


as all four residuals are equivalent. 

When comparing proportions 7 and 7 that are not identical, we may wish to 
quantify exactly how and how much they differ. There are alternative ways of doing 
so by choosing different meaningful parameter functions known as measures of as- 
sociation. We shall briefly consider some of the most common of these, and show 
how to construct their confidence intervals. 


Difference of proportions The simplest and most direct measure of association is 
the difference of proportions which is 


gaitt(70, 71) = 71 — To = —gaite(1 — 71, 1 — 710) 


and we note that comparing probabilities of failure is equivalent to probabilities of 
success, as this only modifies the sign. We always have 


-1< dai <1 


and aire = O if and only if 7; = ao. Based on data Y as above, we get a simple 
1 — a Wald based confidence interval for the difference of proportions as 


- (1 A ae 
Cie) = 71 —To+ nap 71) +4 Tro ( ito) 


ny no 


where 7; = Yii/ni, ni = Yio + Yi1, and z~«/2 is the 1 — a/2 quantile in the 
standard normal distribution. 
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Relative risk A small difference of probabilities may still be important if both 
probabilities are small. If, say, the probabilities of dying from a certain disease is 
0.01 when given one treatment and 0.001 when given another treatment, the differ- 
ence may be considered neglible, but still one treatment must be considered strongly 
preferable. So it is of interest to consider the relative risk, which is the ratio of success 
probabilities rather than the difference 
orr(To, 71) = is 
70 

Note that here it matters whether we are considering successes or failures and the re- 
lation between the two is somewhat complex. In the disease example, the relative risk 
is 10 when considering probabilities of death, whereas the relative risk for survival 
is 0.99. 

Since relative risks may be very small it is safest to construct a Wald interval for 
the logarithm and then exponentiate. The delta method yields that 


i as i l= 
log ae = log 7% — log ty ~ N (ioe ae a “2 
TO 


To 71M TonNo 


and hence an approximate 1 — a confidence interval for the relative risk is 


l-a pan as —Z1-a/28 as Z1—-a/28 
Crr“(Y) = (Fe eres em 


TO TO 


where 2 . 
ao Lm 1 — io Yio Yoo 


5 B = an . 

TN 7ono = - Yuna You no 

Odds-ratio The relative risk is an asymmetric measure which may have its merits 
but also its difficulties, due to the lack of symmetry between success and failure. The 
odds for success is defined as the ratio of probabilities of success and failure, i.e. 
m/(1 — 7) and this suggests comparing the probabilities by forming their odds-ratio 


— mf/(-m) — m(—70) _ im 5 
Por(™1, 70) = idan ) gor(1 — 71,1 — 70) 


The MLE for the odds-ratio is the cross-product ratio 


ee Yi1Yoo 
YioYou 


Also here, the odds-ratio is best considered on the log-scale so we let 
70 


m/(l—m) _ Bs Ty = 
nil ea eae 


where 6; and 6 are the canonical parameters in the two associated Bernoulli expo- 
nential families, see Example 3.19. Thus the asymptotic variance of the MLE of the 


A190 = log dor = log 
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Table 8.3 — Myocardial infarctions after treatment with aspirin or placebo. Source: 
Agresti (2002) based on a study published in the Lancet, 338, 1345-1349 (1991). 


Myocardial infarction 


Group No Yes Total 
Placebo 656 28 684 
Aspirin 658 18 676 
Total 1314 46 1360 
log odds-ratio 419 is 
Be on 1 1 
V (M10) = 
m™(1—74) — nomo(1 — 70) 
1 1 1 1 


NYT, ni(1 — 71) Novo no(1 — 70) 
and the MLE for this variance is 


pt le te ee 2 
Yio Yur Yoo Yoi 


Thus we obtain an asymptotic 1 — a Wald confidence interval for the log-odds ratio 
O10 as 


= Yi1 Yoo i 1 1 1 1 
Coa’ leg ee, ++ 5-4+—- (9) 
togon (Y) eyo PV Me Ba” Yo” Wp 
We conclude the section devoted to comparing binomial proportions with an 
example. 


Example 8.6. [Swedish aspirin study] A study performed in Sweden was made to 
investigate the effect of aspirin use and myocardial infarction. A total of 1360 patients 
who all previously had suffered a stroke were randomly assigned to treatment with 
aspirin or placebo. Table 8.3 displays the number of patients who had or had not 
suffered myocardial infarction after a three-year follow-up period. 

Pearson’s x7 evaluates to X? = 2.13, yielding a p-value of 0.1444 when com- 
pared to a x? distribution with one degree of freedom, so there is no strong evidence 
for the effect of aspirin with a sample of this size. 

The 95% Wald confidence interval for the log-odds ratio 419 as given in (8.9) 
evaluates to (—.16, 1.05). Since 0 is included in this interval, we reach the same 
conclusion. The difference of proportions is not large: 

n a 18 28 

Ty To = 676 684 = —0.014 
as the probabilities of myocardial infarction are small themselves, whereas the relat- 
ive risk 


ft «18 684 
— = = 0.65 
7) 676 28 
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is considerably reduced in the aspirin group (although not significantly!). The reader 
may calculate a confidence interval for the relative risk as an exercise. 


8.3.1.3 Comparing several multinomial distributions 


The considerations above generalize readily to the case of comparing r multinomial 
distributions. Indeed, if we let 


Xo1;--- ,Xono; sie X(r—1)1s “ee X(pA) rips 


denote random variables with values in 7, 


Nj 


Y= 5 al = Sox =OBS,;, i=0,...,r-1j=0,...,s—1 


m=1 


be the total number of observations in category 7 for group 7, and 


r—1 
Y4j= S- Vij 
1=0 


the total number of observations in category 7, we obtain the likelihood ratio statistic 
for the hypothesis of homogeneity 


Ho : % = +++ = Mr-1 
to be 
r—1s-—1 r—ls-—1 
n= a ) OBS; a 


where as before the expected number of objects in category 7 for group 7 under the 


hypothesis is 


Y. 
EXP,; = nt; =n —!. 


The log-likelihood ratio statistic now has an asymptotic x? distribution with degrees 
of freedom 


degrees of freedom = r(s — 1) — (s— 1) = (r—1)(s—1) 
under Ho for n — oo by Theorem 5.39; as does the corresponding Wald statistic 
(Corollary 5.38), leading also here to Pearson’s x? statistic 
r—ls—-l 


os (OBS;; — EXP;;)? 
Wr =X" = > » EXP;; . 


Traditionally, such data are presented in an r x s contingency table which has the the 
form in Table 8.4. The cells of the table are the entries 7, 7 and the cell counts are the 
random variables Y;; . 
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Table 8.4— Anr x s contingency table for the comparison of r multinomial distributions 
with s categories. Random variables Y;; are the number among n,; objects with response 
j in group 7. 


Response 
Group 0 1 s—1 Total 
0 Yoo You Yo,s—1 n0,+ 
1 Yio Yur Yijs—1 N14 
r-1 Y;—1,0 Yp—1,1 ae Y;-—1,s—1 Nr—1,+ 
Total Y+4.0 Vane oo Y4,s—1 n 


8.3.2 Independence of classification criteria 


Next, let us consider objects X1,..., X,, classified according to two criteria , i.e. with 
values in ¥ = I x J where Z = (0,...,r — 1) and J = (0,..., 5 — 1), resulting 
in a contingency table as before, just that now also the row-totals n;+ = Y;+ are 
random. 

Without further restrictions, this corresponds to an unrestricted multinomial ex- 
ponential family as considered in Section 8.1.1, just with the state space coded dif- 
ferently so that the category (i, 7) = (0,0) is coded as 0; in other words, the model 
corresponds to an exponential family with 


Tig 
Oi; = log —; my = 
™00 


with the canonical parameters satisfying 99) = 0 and 6;; € R for (i, 7) 4 (0,0). 

We are interested in investigating whether the categories in the cross- 
classification are independent or, in other words, whether the coordinate projections 
I: X++Tand J: X +} J are independent random variables. In other words, our 
hypothesis of independence expressed in terms of 7 is 


Ao 1 Tig = M47 +5; 
or, expressed in terms of the canonical parameters above: 
Ho: 65; =aj;+ B; (8.10) 


where ag = 89 = Oanda,, 6; € Rfort = 1,...,r —landj =1,...,s—1 since 
under the hypothesis of independence we have 


Tj | 
6:3 = log —4+ = log J =a; + 8; 
700 TO+LT+0 
where 
Tot T44 
a; =log*, 8, = log 4 
TO+ T+LO 
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The hypothesis in (8.10) defines a linear subspace of the rs — 1 dimensional ca- 
nonical parameter space of the unrestricted model with dimension r + s — 2; and 
hence, by Theorem 3.14, this is again a regular exponential family and—see Ex- 
ample 3.27—thus a curved multinomial model as discussed in Section 8.1.2. Hence, 
the associated likelihood ratio and Pearson’s y? statistics are those given in (8.6) and 
(8.7) and become 


r—1s-1 r—1ls-1 


(OBS,; — EXP;;)? 
G? =25~S- oss,; los oe =yy! xP i) 


1=0 j=0 i=0 g=1 


Both of these statistics are asymptotically y? distributed with degrees of freedom 
determined by the difference of the dimensions of the models 


degrees of freedom = rs—1—(r+s—2)=(r—1)(s—1). 


We need to find the expected cell counts EXP;; under the hypothesis of inde- 
pendence. The log-likelihood function under the hypothesis is 


r—1s—-1 r-1s-1 
Am) = SOS Yj log my = S77 Vij log(mi+74,) 

i=0 j=0 i=0 7=0 
r—1s—-1 r-ls-l 

_ SoS Vijlog mz t+ $2 So Vij log 14; 
i=0 j=0 I 
r—1 s—l 

= S Vi4 log mis + 5° V4, log 745. 
i=0 j=0 


Thus the log-likelihood function is a sum of two log-likelihood functions, each de- 
pending on their own separate parameters and each of the same form as the standard 
multinomial likelihood for classifying objects separately into the groups Z and 7. It 
follows that the MLE under the hypothesis is given as 


and thus the expected cell counts are 


pote 
nn nn - 


EXP;; => NR LTS, = 


Note that under the hypothesis of homogeneity, the expected cell counts were equal 
to EXP;; = nj4Y ,/n; but since in that case we have Y;; = nj, the estimated 
cell counts are identical to those under the hypothesis of independence. It follows 
that also the test statistics are identical, and we note that so is their asymptotic y? 
distributions. The exact distribution of the test statistics differ, as Y;, are random in 
the independence model, but fixed in the model of homogeneity. 
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Table 8.5 — A 4 x 4 contingency table for the cross-classification of 592 students at the 
University of Delaware according to the colour of their eyes and their hair, collected by 
Snee (1974). 


Eye colour 


Hair colour Brown Blue Hazel Green Total 


Black 68 20 15 5 108 
Brown 119 84 54 29 286 
Red 26 17 14 14 fil 
Blond 7 94 10 16 127 
Total 220 = 215 93 64 592 


Table 8.6 — Standardized Pearson’s residuals for the cross-classification according to 
hair and eye colour in Table 8.5. Residuals larger than 3 are highlighted by bold type. 


Eye colour 


Hair colour Brown Blue Hazel Green 


Black 6.14 -4.25 -0.58  -2.29 
Brown 2.16 -3.40 2.05 -0.51 
Red -0.10 = -2.31 0.99 2.58 
Blond -8.33 9.97 -2.74 0.73 


Example 8.7. [Hair and eye colour] To illustrate the developments above, we con- 
sider a cross-classification of 592 students at the University of Delaware according 
to their hair colour and eye colour. The data are displayed in Table 8.5. 

We are interested in investigating whether these characteristics of individuals are 
independent or related. The Pearson y? statistic evaluates to X? = 138.3 which 
yields an asymptotic p-value of 0 when evaluated in a x? distribution with degrees 
of freedom being (4 — 1)(4 — 1) = 9. Thus the characteristics are not independent. 
The likelihood ratio statistic becomes G? = 146.4 leading to the same conclusion. 

To understand more about the nature of deviations from independence, we in- 
vestigate the standardized residuals, displayed in Table 8.6. It is apparent that a main 
feature of these data is that very few students with black or brown hair have blue 
eyes, whereas blue eyes are very common for students with blond hair. 


8.3.3 Poisson models for contingency tables 


For certain types of cross-classified data, it is not reasonable to assume that the total 
number n of classified objects is fixed. This holds, for example when the entries in the 
table represent the number of events (errors, accidents, deaths, diseased individuals) 
in a specified period or region, classified according to characteristics J = (1,..., |Z) 
and J = (1,...,|J|). In such cases, the total number of events 7 is not controlled. 
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As in the previous subsection, data may still usefully be presented in a contin- 
gency table, just with the modification that also the entry in the lower right corner of 
that table, containing the total number of objects now usefully may be denoted Y,4+, 
to emphasize its random nature. 

A generic model for data of this type is that the entries in the contingency table 
are independent and Poisson distributed with 

ig 
Pa(Yig = eg) = pe i=0,...,r—1,7=0,...,s—1. 
aa 


This corresponds to the outer product construction—described in Section 3.4—of 
individual simple Poisson models, and hence, this family of distributions is a regular 
and minimally represented exponential family with canonical parameter 9 € O = 
IR’ where 
6:4; = log Ai;, 4=0,...,r—1;7 =0,...,s—1 

and the table of counts as the canonical statistic. We shall refer to this model as the 
unrestricted Poisson model. In this exponential family, the MLE is simply equal to 
the observed table of counts 

Mg =4j 1=0,...,r-—1;9=0,...,8-1. (8.11) 
In the following we shall investigate hypotheses and models obtained by specifying 
various restrictions on the parameters. 


6.3.3.1 The simple multiplicative Poisson model 


The first submodel we consider is the simple multiplicative Poisson model, determ- 
ined by restricting the expectation to have the multiplicative form 


Ay = Epin;, 1=0,...,r—1;7 =0,...,8-1 (8.12) 


for some pp € R,, p € R’ and 7 € R& with pp = Ho = 1 or, equivalently, restricting 
the canonical parameter to have the additive form 


0; =log Ay =yror+f8;, 1=0,...,r—1,7 =0,...,8—1 


for some y € R, a € R’, and 6 € R* with ag = $y = 0. 

This model was developed and investigated by the Danish statistician Georg 
Rasch (1901 — 1980) in Rasch (1960). It specifies that 9 = log A € L where L is 
a linear subspace of IR’* of dimension r + s — 1 and is known as a log-linear model. 

By Theorem 3.14, the multiplicative Poisson model is a regular exponential 
model and a curved (linear) submodel of the unrestricted Poisson model. Indeed 
we may write the joint density of the model as 


r—1ls-l 


fiz,a,a)(¥ =y) = II I el Vto+By)yig— eT OP 
i=0 j=0 


r-1 s—l1 
evettt ia ayirt) ey Bj ¥+5—P (He, 8) 
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with respect to the discrete base measure v 


identifying the canonical statistic as 


to(y)" = (yes Yrty e+ Yds YI + Y4e—1) 


and the cumulant function 


r—1s—-1 r-ls-l 
b(7.0,8) = D7 Setters = SS Ag. 
i=0 j=0 i=0 j=0 


We note that the canonical statistic is in one-to-one linear correspondence with the 
marginal totals of the contingency table 


to(y) = (Yor, se ey Yr—1,45Y405--- »Y4,s—1) 


since we have 


Yt = Yip Pet Yr t = Yt tee. + Y4,s—-1 


and 


Yot = Y¥44 7 Y14 7 T Ur-1j4> Yo = Yet — Y41 7° °* 7 Y4,s-1- 


We thus obtain the MLE in this model by equating the observed canonical statistic to 
its expectation or, equivalently, the MLE is the unique point in the model that satisfies 
the equations 


Ait = Yi4,4=0,...,7—1; A+j = Y4j9 =0,..-,8-1 
where we have exploited the linear relationship between the canonical statistic ¢ and 
the marginal totals ¢. We next note that if we let \ be 
Nes = YitU45 
Y4+ 


? 


these equations are indeed satisfied since adding the numerators over either of the 
indices yields the relevant marginal totals. Since we have 


r—-1ls-1 r—-l1s-1 

5 5 diy => 5 Yij = YHH 

i=0 j=0 i=0 j=0 

and also 
r—-ls—-1) r—-1s-1 3 
xe Yit Utd 
= U++) 

i=0 j=0 0 ee 
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the log-likelihood ratio statistic becomes 


r—-ls-l r—ls-l ig 
@ = ay ys 0 
7=0 7=0 aj 1=0 7=0 
r—-ls—-l r—-ls-l 
= 25°F wylog #4 = 2° FBS, los SH 
7=0 7=0 aj 1=0 7=0 


where again OBS;; and EXP;; are the observed and expected cell counts 
Ya+ 


Since in this model, individual cell counts are independent and the Fisher inform- 
ation in the simple Poisson model is i(A) = \~1—derived in Example 1.26—we get 
the Wald statistic 

Folge t) 4 r—ls—1 (OBS EXP. )2 
WT = (N.. 4 ij — ij 
XP =W= TY iGa hy Hy? = TY Ou ere 


i=0 j=0 i=0 j7=0 


We note again that both of the log-likelihood ratio and Wald test statistics in this 
model are the same function of the cell counts as in the case of testing for homogen- 
eity of multinomial distributions or independence of cross-classifications. And—as 
argued in Example 5.41—we may use the standard asymptotic results if just all of ;; 
are large, so both of these statistics are asymptotically y? distributed with degrees of 
freedom determined by the difference of the model dimensions: 


degrees of freedom = rs — (r +s—1) = (r—1)(s—1) 


which also is exactly the same as we found in the homogeneity and independ- 
ence cases. Again we note that the asymptotic distributions of the test statistics are 
identical in contrast to the exact distributions, as these involve different distributional 
assumptions for the sample marginals. 


Example 8.8. [Incidence of Covid 19] Table 8.7 shows the number of confirmed 
cases of Covid 19 infection in four local Danish communities in the period 4-10 
October 2020. Copenhagen West combines data from Ishgj, Br@ndby, and R¢dovre 
municipalities and Copenhagen City combines data from Copenhagen and Frederiks- 
berg municipalities. We shall investigate whether this table may be well-described by 
a multiplicative Poisson model. 

The log-likelihood ratio statistic can be calculated to G? = 151.5 and Pearson’s 
evaluates to X? = 139.7. Since the degrees of freedom for the asymptotic y? distri- 
bution is (4 — 1)(8 — 1) = 21, this yields an associated p-value which is numerically 
O, thus in effect ruling out the multiplicative Poisson model as a reasonable descrip- 
tion of data, even though some of the entries in the table are quite small, thus casting 
some doubt on the validity of the y?-approximation. A Monte Carlo-based method— 
to be discussed in Section 8.3.4.3 below—yields p = 0.0005, also effectively ruling 
out the multiplicative Poisson model as a good description of the table. 


2 
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Table 8.7 — Number of confirmed infections in the period 4-10 October 2020 with 
Covid-19 in four Danish local communities. Source: Statens Serum Institut. 


Age group 
Area 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70+ Total 
Copenhagen West 7 42 6 11 16 18 14 9 123 
Aarhus 11 74 103 33 28 26 13 7 295 
Copenhagen City 36 51 153 70 64 36 16 11 437 
Slagelse 3 14 2 2 13 14 3 1 52 
Total a7 181 264 116 121 94 46 28 907 


Table 8.8 — Standardized Pearson residuals for a multiplicative Poisson model applied 
to the Covid 19 data in Table 8.7. Residuals larger than 3 are in bold type. 


Age group 
Area 0-9 10-19 20-29 30-39 4049 50-59 60-69 70+ 
Copenhagen West —_-0.29 4.24 -6.36 -1.37 -0.12 1.67 3.43 2.92, 
Aarhus -2.20 2.68 2.67 -1.00 -2.37 -1.06 -0.63  — -0.86 
Copenhagen City 2.34 -6.02 3.77 2.81 1.11 -2.03 -1.87  -0.96 
Slagelse -0.16 1.29 -4.13 -1.99 2.55 4.04 0.24 = -0.50 


To investigate the deviations further, we consider the standardized Pearson resid- 
uals, as displayed in Table 8.8. There are several of these residuals that are far too 
big; for example, there are far too many incidences in age groups 10-19 and 60-69 in 
Copenhagen west (42 and 14) compared to what would be expected (24.55 and 6.24) 
and far too few in the age group 20-29 in Copenhagen West (six observed cases and 
35.8 expected). Also, the number of observed cases in Copenhagen City in the age 
group 20—29 is far too large (observed 153 and 87.21 expected). 

There are many possible explanations of these deviations from the multiplicative 
Poisson model, but it is clear that the phenomenon of Covid-19 is more complex than 
the multiplicative Poisson model is able to accommodate. One notable omission in 
the analysis is that the age distributions in the four areas could be quite different, 
potentially explaining some of the inhomogeneities in the Covid-19 incidence. The 
population sizes in the various age groups are displayed in Table 8.9 and we note 


Table 8.9 — Population size in local areas in Denmark by age group. Source: Danmarks 
Statistik. 


Age group 
Area 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70+ 
Copenhagen West 12,025 11,459 12,126 13,239 13,179 13,226 10,532 13,192 
Aarhus 37,238 34,360 86,348 46,231 38,009 38,180 32,370 36,697 


Copenhagen City 80,772 61,161 172,299 135,122 96,076 78,365 54,249 59,109 
Slagelse 7,642 9,132 10,069 8,342 10,042 11,482 9,765 12,681 


CONTINGENCY TABLES 203 


that whereas the distribution over age groups in Copenhagen West and Slagelse are 
almost uniform, Copenhagen City and Aarhus both have a much larger population 
in the age group 20-29 than Slagelse and Copenhagen West, which might explain 
the relatively large number of infected individuals in these groups. Below we shall 
discuss models that are able to incorporate structural information of this type. 


8.3.3.2 The shifted multiplicative Poisson model 


We shall consider a modification of the multiplicative Poisson model that takes a 
background risk into account. More precisely, we assume that the mean number of 
events in category 77 has the form 


Nj = HpiNjBy, @=0,...,7—1,7 =0,.-.,8—1 


where 
By; >0, i=0,...,r—1,7=0,...,s—-1 


is a table of known positive numbers, meant to indicate the magnitude of known 

factors that affect the number of events in groups i and j with 4, pi,n; € R+ and 

Po = No = 1 as before. Models of this type are commonly used for pricing in non-life 

insurance, where 6;;, for example, is the number of customers in risk group 77 and 

Yj; the number of claims from customers in that risk group in a specific time period. 
Rewriting the model in terms of canonical parameters 6;; = log X;;, we get 

05; = y+ Oi + B; + log Bi; 


so this is an affine subfamily in contrast to the simple multiplicative model considered 
above which defined a linear subfamily. We shall refer to this model as the shifted 
multiplicative Poisson model. Theorem 3.14 implies that this is a regular exponential 
model and an affine submodel of the unrestricted Poisson model. 

As for the simple multiplicative model, we may write the joint density as 


r—1s-—l 


Ffiu,e,8)(Y =y) = II II el ¥tai+Bj+log Bij )yig—e 1 TT P5 F108 Fis 
i=0 j=0 
r—-1 s—1 
= exp(yyset S- aiyis + Byyss — ¥ (ue, B)) 
i=1 j=l 


with respect to the discrete base measure v 
r—-1ls-l BY 
ne = Uw 
oy) = [TIT 
i=0 j=0 Yij- 


identifying the canonical statistic as 


t(y)! = (Y44, 4145 -o+9Yr—1454U41>--- ee ea) 
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and the cumulant function 


r—1s-1 r—1s-1 

7, = +a;+8;t+log Bij _ 

¥(7,, 8) = 5 5 a" ee S S Aij- 
i=0 j=0 i=0 j=0 


Note that the affine component log B;; changes the base measure and the cumu- 
lant function, but not the sufficient statistics which also here are 


ty)" = (yas Yate Yr—dts Yds Yte-1) 


We thus obtain the MLE in the shifted Poisson model by equating the observed 
canonical statistic to its expectation or, equivalently, the MLE is the unique solution 
to the equations 


A 


Ait = Yi4,4=0,...,7-1; Aig = iy =O 1 (8.13) 


provided it is well defined. We have again exploited the linear relationship between 
the canonical statistic t and the marginal totals ¢. 


s—1 
S > upin; Bi = 9 t= Oh PHL 
j=0 
r—1 
S > upiny Bij = Y4pj=O0,...,8—1. 
i=0 


In contrast to the simple multiplicative model, the likelihood equations cannot be 
solved explicitly since \;; has a more complex relation to the parameters of the 
model. An exception is the case when r = s = 2, see Exercise 8.8. 

However, the form of the equations suggests a simple iterative procedure for 
solving them. The procedure is initiated at uw = y,4/B44, and pj = n; = 1 and 
now updates the parameters as 


Yo+ Yi+ . 
pe , Bem ee (8.14) 
j=0 15 Bo; pao HN; Big 
poe — ae ae ne 5 De ae (8.15) 


; lj = 
¥o9 PB Dice He Bi; 


using previously calculated values of the parameters on the right-hand side to up- 
date the values on the left-hand side. The iteration may become a bit clearer when 
expressed in terms of repeatedly updating the estimates for the means \,; as: 


Ae © dij 4= 0,-0.7 1, i. © diy PEF 0,081. (8.16) 
i+ +9 


Indeed this procedure is known as iterative proportional scaling (IPS), since at each 
step the table of expected values is scaled proportionally by the ratio of observed 
to expected marginals. The algorithm can be shown to be convergent and converge 
to the MLE which is well defined as long as all marginal totals are positive, see 
Section B.3 for details. 
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Table 8.10 — Standardized Pearson residuals for a shifted multiplicative Poisson model 
applied to the Covid-19 data in Table 8.7. Residuals larger than 3 are in bold type. 


Age group 
Area 0-9 10-19 20-29 30-39 4049 50-59 60-69 70+ 
Copenhagen West —-0.74 2.40 -3.82 -0.95 -0.67 0.69 2.45 1.85 
Aarhus -2.00 2.41 0.94 0.29 -1.33 -0.71 -0.85 -0.97 
Copenhagen City 2.41 -4.18 2.44 0.93 0.61 -1.32 -0.77 0.13 
Slagelse -0.02 0.21 -2.73 -1.36 2.24 2.80 -O.51 9 -1.12 


To see that the updates in (8.16) are equivalent to those in (8.14) and (8.15), we 
express the marginal scalings in terms of the model parameters so that, for example, 
the row updates are 


Yi 
ye eRe 


Thus we see that the factor ju; is updated by the ratio of the row total to a weighted 
sum of the base risks B;;, whereas other parameters are unchanged. For 1 = 0 we 
get the first update in (8.14) and for 7 > 0 we get the second by division with j. The 
calculations are analogous for the column updates. 


Yi+ 
Mping Big — mping Bij See. a 
k 


=7: Bs 
iM Bik aon 


Example 8.9. [Incidence of Covid-19] This is a continuation of Example 8.8, where 
we had to abandon the multiplicative Poisson model with a suspicion that the dif- 
ferent age distributions in the areas under investigation could give an inadequate 
description of the situation. 

To investigate whether this is a possible explanation of the deviation from the 
Poisson model, we consider a shifted multiplicative model where the number Y;; of 
confirmed Covid-19 infections in area 7 and age group 7 is assumed to be independent 
and Poisson distributed with expectation ,;;, where now 


Az = Mein; Bis, t= 054.255.3539 = OF 


with pw, pi,7; € R4 and po = jo = 1 and Bj; is the total number of individuals 
living in area 2 in the age group 7. 

This model is clearly an improvement since we find a likelihood ratio statistic of 
G? = 71.3 and Pearson’s y? evaluates to X? = 65.8. However, both of these give 
asymptotic p-values around 10~° when compared to the asymptotic \ distribution 
with 21 degrees of freedom as before, so we conclude that the shifted model is still 
not an acceptable description of the data. 

The standardized residuals are displayed in Table 8.10, and we note that the most 
extreme residuals from the simple multiplicative Poisson analysis have disappeared 
or have become less extreme. There are still extreme residuals in the age group 20-29 
in Copenhagen West, where the number of infections is smaller than expected, and 
the same holds for the age group 10-19 in Copenhagen City. The expected numbers 
from the shifted multiplicative model are displayed in Table 8.11 and comparing 
with the observed incidences in Table 8.7 we note that only six cases were observed 
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Table 8.11 — Expected number of infected individuals based on a shifted multiplicative 
Poisson model applied to the COVID-19 data in Table 8.7. 


Age group 
Area 0-9 10-19 20-29 30-39 40-49 50-59 60-69 70+ 
Copenhagen West 8.95 31.39 21.21 14.15 18.45 15.67 7.93 5.26 
Aarhus 17.79 60.39 96.92 31.70 34.14 29.03 15.64 9.38 
Copenhagen City 27.22 75.85 13646 65.38 60.89 42.04 18.50 10.66 
Slagelse 3.04 13.37 9.41 4.76 751 7.27 3.93 2.70 


among 20-29 year olds in Copenhagen West, whereas 21 were expected, and only 51 
cases were observed in Copenhagen City among 10-19 year olds, whereas 76 were 
expected. 


Still, there are too many residuals that are moderately large, as the formal test 
also indicates. This may be due to the Poisson distribution not being adequate for 
events of these type, as infections among individuals are not singular, random events, 
but tend to appear in clusters of larger outbreaks as individuals infect each other. 
This would typically lead to the variance being higher than predicted by the Poisson 
distribution—a phenomenon known as over-dispersion—and thus the Pearson y? 
statistic becomes large. We would then estimate an over-dispersion factor 0? as 6? = 
xX fl = O5.8/21 = 3.13. 


For illustration we may calculate confidence intervals for the parameters in the 
model, but since the Poisson variance is too small, we inflate the confidence intervals 
by multiplying the estimated standard deviations of the canonical parameter estim- 
ates by the square root of the over-dispersion factor, i.e. with 3.13 = 1.76. 


The inflated confidence intervals are displayed in Table 8.12. We note that all 
areas have infection rates below those in Copenhagen West since | is outside the 
confidence intervals for the rates associated with areas. Also, the factors related to 
age groups 10-19 and 20-29 stand out as the highest among the age groups, but 
caution should be taken when interpreting the confidence intervals in the light of the 
shortcomings of the model that we have established. 


Table 8.12 — Inflated 95% confidence intervals for parameters pu, p,7 in the shifted 
multiplicative Poisson model as applied to the Covid-19 incidence data. The inflated 
confidence interval for the base infection rate, i.e. the infection rate 44 in age group 0-9 
in Copenhagen West yu is 4.3—-12.7 per thousand. 


Aarhus Copenhagen City Slagelse 
0.44-0.93 0.32-0.65 0.30-0.95 


10-19 20-29 30-39 40-49 50-59 60-69 70+ 
2.2-6.2 1.4-3.9  0.82-2.5  1.1-3.3 0.89-2.8 051-2 0.24-1.19 
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8.3.4 Sampling models for tables of counts 


We have previously noted a striking similarity between the models for contingency 
tables associated with homogeneity, with independence of cross-classifications, and 
multiplicative Poisson models. Their main difference were associated with assump- 
tions on whether various totals were considered fixed or random. 

In the multiplicative Poisson model, all entries were random and hence all totals 
as well; the model for studying independence of cross-classifications had the grand 
total Y;4 = n fixed, but other entries random; when considering homogeneity, the 
row totals Y; = n,; and hence the grand total were fixed. 

Clearly, these models represent different ways in which the data have been col- 
lected, but it also represents the choice of representation space used in the statistical 
modelling. For example, in the study concerning myocardial infarction, the alloca- 
tion to aspirin vs. placebo was in fact made by randomization, but as the randomiza- 
tion itself was not relevant for the problem, we chose to consider the marginal totals 
fixed, rather than binomially distributed. This reflects that we consider this additional 
variation irrelevant, so we prefer to avoid this complication. 

Luckily we noticed that all test statistics and all asymptotic distributions of these 
did not depend on which of the three situations we consider and, as we shall see, this 
is not a coincidence as the models are directly related in a specific and simple way as 
we shall demonstrate below. 


8.3.4.1 From multiplicative Poisson to independence 


Recall that the Poisson model has density 


PY =y)=[[ [[ e*. 
i=0 j=0 Yi" 


r—-ls-l Ni 


The marginal total Y+ is a sum of independent Poisson distributed random variables 
and is therefore Poisson distributed with parameter A, = )°,; 5 Ax; and has therefore 
density 


Py(¥44 = y44) = 


ys! 
If we wish to consider the observations in a representation space where Y,_ = n is 
fixed and thus ignore the possibly irrelevant information in the grand total, we may 
calculate the conditional distribution which for any y with yi, = n is 


Px(Y = yand Yi4 =n) 


PY =y|¥4+ =n) 


Py(Y44 = n) 
Vig 
r-1 sl ij Aas 
i=0 L1j=o Yj! 
Ut+ 
Xr 
a e7At+ 
y+4! 
r—-ls-1 Vig 
aa ! Vig 
Te Vij i=0 j=0 AY 


r—-l1ls-1 
n a 
ITI. 
Yoo, -++5Yr—1,s—1 


i=0 j=0 
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where now 7; = Ax; /A++. Thus the unrestricted Poisson model becomes the un- 
restricted multinomial model when considering the grand total fixed. 

In addition, if we assume that the parameter has multiplicative structure as in 
(8.12), ie. if Ay; = upinj, we have 

— BRING PEM 
Nig = GT YG 
MP+1+ P+ T+ 

Hence, the multiplicative Poisson model becomes the model for independence of 
cross-classifications and vice versa since if 


rij Nit Ati 
a — TT — eee 
anaes : PO Ag Nae 
we also have LD 
ae Oe 
Aig = Att ee rea = LPiN; 


which has the multiplicative form desired. 


8.3.4.2. From independence to homogeneity 


If we now proceed yet one step further and also consider the marginal row totals 
Yi4 = n;, fixed, we get similarly 


Py(Y = yand Yi4 = 14,4 = 0,...,7 —1) 


PY =y|VYi4 =,1=0,...,7 1) Ae tetas) 


res 
ee ole 0 re oo 


where now 7; = diz / A+;. Thus, again, as we condition with the row totals, the 
unrestricted multinomial model becomes the model of independent and unrelated 
multinomial distributions for the rows of the table. And, if now \ has multiplicative 
structure, we get 

rij = rij — MPMNG _ Pi 

A4j  MP4NG P+ 

so the multiplicative Poisson model becomes the model of homogeneity of multino- 
mial distributions. 


8.3.4.3 Exact conditional tests 


If we are focusing our interest on aspects of the distribution that are unrelated to 
the actual size of the marginals, for example if we are interested whether or not 
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the hypothesis of multiplicativity, independence, or homogeneity is valid, we may 
wish to go one step further and consider all marginal totals fixed. Using calculations 
similar to those above it may be shown that—under any of these null hypotheses— 
the conditional distribution of the entries of the table given the marginal totals is 


P(YY y| Yi4 Nit st 0,... —1Yy; =n4;,7 =0,...,8-1) 
Th ofa da 


= (8.17) 
nl Ih= ale 0 Yis! 


which is known as the multiple hypergeometric distribution. 

An important feature of this distribution is that it is free of unknown parameters 
as long as either of the hypotheses of multiplicativity, independence of classifica- 
tion criteria, and homogeneity of multinomials is fulfilled. Although the distribution 
appears complicated for large tables, there is a simple Monte Carlo algorithm for 
sampling from this distribution due to Patefield (1981). Hence, the method provides 
a generic basis for calculating Monte Carlo p-values for essentially any test statistic 
of interest and thus an interesting alternative to using asymptotic results, in particular 
when some of the entries in a table are small. Although Monte Carlo methods must 
often be used to calculate the p-value, we refer to tests using the multiple hypergeo- 
metric distribution as exact conditional tests. 

In the special case of a 2 x 2 contingency table, the conditional distribution in 
(8.17) simplifies to an ordinary hypergeometric distribution as all entries Y;; in the 
table are deterministic functions of Yoo and the marginal totals. More precisely, we 
have 


eel), 
Gan! 
Example 8.10. [Smoking and myocardial infarction] We shall illustrate the use of 
exact conditional test applied to data from a study of association between smoking 
and myocardial infarction, displayed in Table 8.13. The category ‘Heavy smoking’ 
corresponds to regular smoking of more than 25 cigarettes a day. The numbers in the 


table are small and one may fear that the asymptotic p-values are far from the actual 
ones. In fact, the Pearson x7 statistic evaluates to W = 6.96 which compared to a 


P(Yoo = y| You = no+, Yi4 = M14, Yeo = 40, Yo1 = N41) = 


Table 8.13 — Number of patients suffering myocardial infarction compared to smoking 
habits and controls. Source: Agresti (2002), based on a study published in the Lancet, 
313, 743-747 (1979). 


Smoking level 


No smoking Moderate smoking Heavy smoking Total 
Control 25 25 12 62 
Myocardial infarction 0 1 3 4 
Total 25 26 15 66 
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x?-distribution with 2 degrees of freedom yields an asymptotic p-value of p = 0.03, 
whereas the exact p-value may be calculated to be p = 0.052. The likelihood ratio 
test evaluates to G? = 6.69 with an asymptotic p-value of p = 0.035, whereas a 
Monte Carlo approximation based on N = 5000 samples yields p = 0.074. These 
tests ignore the fact that the variable indicating smoking habits has an ordinal nature, 
and may not reveal a tendency of the probability of infarction increasing with the 
amount of smoking. 


Note that if we reject the hypothesis when the conditional p-value is less that a, 
the test also maintains the overall level a. For if we let K(.N+1) denote the critical 
region for the conditional test, we then have the conditional level H(K(N41)) <a 
and thus 


Po(K(N41)) = Datuk (N41) | N41 = y)Po(N1 = y) 
7 Se )) Po(Na1 = y) 
< a (N41 = y) =a. 


y 


8.3.5 Fisher’s exact test 


We shall here focus on the case where r = s = 2 and illustrate the concepts with 
data from Example 8.7 in Hansen (2012). 

Here n = 130 individuals were classified according to whether or not they used 
an old or a new type of computer screen in their work, and whether they had problems 
with reflections from the screen or not. This resulted in the following table: 


Reflection problems 


Screen type No Yes Total 
Old 15 50 65 
New 27 38 65 
Total 42 88 130 


Several different models are possible for these data, as it is not directly specified 
how the data were collected. Were they simply collecting data from all individuals at 
the department involved, did they deliberately choose a sample size of n = 130, or 
were they ensuring that exactly 65 of each screen type were asked about reflection 
problems? 

Since we do not know how the marginals were collected, we may use a sampling 
model with all marginals fixed, thus performing an analysis based on the exact con- 
ditional distribution. More precisely, we consider Xj, ..., X42 to be binary variables 
corresponding to those that have reflection problems with their screens and indicat- 
ing whether the screen is old or new, s. If there is no difference between old and new 
screens, we may consider these to be sampled at random and without replacement 
from the set of all screens, having 65 of each type. Now we let X = Ny = Shae X; 
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we have that X follows a hypergeometric distribution 
65) ( 65 
( a ) ez) 
Go) : 
42 


Fisher’s exact test, described in Fisher (1934) uses the size of this probability itself 
as test statistic: 


P(N = 2) = h(x | 42,65, 65) = 


d(x) = 1— h(a | 42, 65, 65) 


so values of x are considered extreme if they have very small probability in the hy- 
pergeometric distribution. Thus, in this hypergeometric model, the p-value becomes 


p(«) = S- h(a | 42,65, 65) = 0.03847. 
(a:d(x)>h(15 | 42,65,65)) 


The sum in the expression ranges over x € (0,...,15,27,...42), as these are the 
values with smaller probability than the observed. The procedure is illustrated in 
Figure 8.1. 


Exact test 
a 
<4 
fe oS ol 
SoS Ss 
6 
ee 
pul 
2 ¢ 
nod =} 
221 oTT Lilt? 
2 
0 10 20 30 40 
xz 


Figure 8.1 — The hypergeometric distribution used for the exact test in the screen glare 
example. The probability of the observation is indicated by a dot. The horizontal line has 
height (15 | 42, 65, 65) and the p-value is the sum of the sizes of the point probabilities 
that are below or on that line. 


212 MODELS FOR TABLES OF COUNTS 
8.3.6 Rank statistics for ordinal data 


We shall illustrate the flexibility of exact conditional tests by comparing multinomial 
distributions with ordinal categories, i.e. in situations such as in Example 8.5 con- 
cerned with the weight of trout or Example 8.10, where there is a special emphasis of 
discovering deviations from the hypothesis in the form of a translational shift in the 
distribution: is there a tendency of the response distribution shifting to the right or left 
when changing category. In the trout example, are the fish generally smaller? In the 
smoking example, do the infarction group generally smoke more than the controls? 

The standard test statistics G? and X? do not take the ordinal structure into ac- 
count, as a permutation of the response categories would have no influence on the 
values of either. So we would wish to consider alternatives. One such alternative is 
the Mann-Whitney or Wilcoxon rank statistic M? which we shall describe below. 

Consider a table of the form in Table 8.4, but with only r = 2 categories to be 
compared. Imagine first that there is at most one observation in any response category 
and the data therefore may be completely sorted as 


(v1, 91); (v2, 92), ees (Uns Gn) 


with g; € (1,2) denoting the group and v; the value of the ith observation where 
now 
Vy < Ug << +++ << Uy. 


We define the rank of an observation v; as p(v;) = i and add the ranks for the 
observations in the first group as 


H,= > plu)= > i 
i:gi=l i:gi=l 


If there is no difference between the groups, a simple combinatorial argument shows 
that we would have 


n?—1 —n—-1 


2” V(p(V1), p(V2)) = 12 


n 
E(A(V))= 5, V(e(V)) = 
and thus if there are 214 observations in the first group, we have 


+1 n+1 
, V(Mi)= M4024 —T5— 


nm 
E(A) = y+ 


which leads us to construct the test statistic 


(A, — E(A,))? 


le i 


and it can be shown that in the situation where the distribution of the response vari- 
able V is continuous so there are no ties among the observations, M? is approxim- 
ately y?-distributed with one degree of freedom when both groups are large. 
However, when applying this to ordinal data, the ranks are not well-defined, as 
many observations have the same value. We then instead allocate the mid-rank to all 
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data in the same response category. For example, in category 0 we have observations 
Y+0 = Yio + Yao and these observations all receive the rank (1 + Y,0) /2. Similarly, 
the data in group 7 are given ranks 7; + (1 + Y;,;)/2, where 7 = 0 and 7; = 
Yio +--+: ¥4,;-1 for 7 > 0. Thus we define the tied rank sum Hf for group 1 as 


s—l 
: LAyie 
A; => ) Ga pt) vi. 


j=0 


However, now it is less easy to determine the variance of the statistic. If we normalize 
it as before, we get 
(Hy — EU)? 


MS 


the asymptotic x? distribution for 1? may now be less adequate, and we may there- 
fore evaluate the p-value in the exact conditional distribution with all marginals fixed, 
using, say, a Monte Carlo algorithm. 


Example 8.11. If we consider the trout data in Example 8.5, the value of M7? be- 
comes 2.02, yielding an asymptotic p-value (ignoring ties) of p = 0.15. If we instead 
calculate the Monte Carlo p-value based on N = 5000 simulated tables, we get 
p = 0.13; none of these are significant, so we are still not able to document an 
effect of the waste water release on the weight of trout in the lake although the p- 
value is much smaller than for the standard G? or X? test statistics, where we found 
p = 0.67. 

For the data on smoking and myocardial infarction, we get M? = 4.87 which 
yields an asymptotic p-value of p = 0.027, whereas the Monte Carlo p-value based 
on 5000 simulated tables is p = 0.035. Now this should be compared to the results 
for G? with a Monte Carlo p-value of p = 0.074. Thus the ordinal test statistic will 
just reject the hypothesis of equality between the infarction and control group on a 
5% level, whereas G? will not. 


8.4 Exercises 


Exercise 8.1. 150 Petri dishes prepared with streptomycin were each subjected to 
one million E-coli bacteria. If a bacterium mutates to become resistant, it will grow a 
colony of surviving bacteria. Otherwise the streptomycin will kill the bacteria. After 
growth, the following number of surviving colonies was observed: 


Number of colonies per dish 0 1 2° 3 4 
Number of Petri dishes 98 40 8 3 1 


a) Argue why it would be reasonable to expect that the number of colonies in a Petri 
dish would be Poisson distributed; 


b) Estimate the parameter under the assumption of a Poisson distributed number of 
colonies. 


c) Investigate whether the observations conform with the Poisson model. 
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Note that as the Poisson distribution has infinite support, a little modification is 
needed to transform the data into a multinomial distribution to answer question c). 
This may, for example be done by merging the categories 3,4,... into a category 
> 3 with a count of 4, or even merging to > 2 with a count of 12. This now becomes 
a curved multinomial family, and the analysis may proceed along the same lines as 
in Example 8.2. 


Exercise 8.2. The /ogarithmic distribution is a distribution on integers with density 
oe 


fol2) = oat — ))’ GS 15.2500 


for 0 € (0,1). Williams (1943) investigated the distribution of the number of pub- 
lications per author in a 1913 volume of Review of Applied Entomology with the 
following result: 


Number of articles per author 1 2 3 4 5 6 7 8 9 10 
Number of authors 285 70 32 10 4 3 3 1 2 «1 


There are a total of 411 authors and 656 articles. Does the logarithmic distribution 
describe these data sufficiently well? 

Again, as in the previous exercise, some categories need to be merged, for 
example introducing a category of > 7 with a count of 7. 
Exercise 8.3. The haptoglobin type for humans are determined by a single diallelic 
gene with alleles Hb1 and Hb2. The genotypes are denoted Hb1, Hb1 Hb2, and Hb?2. 


Haptoglobin types were determined for 607 men and 1439 women, selected at 
random from the Danish population, resulting in the following table: 


Haptoglobin type 
Hb1 Hb1Hb2 Hb? Total 
Men 83 289 = 235 607 
Women 245 678 516 1439 


a) Is the distribution of haptoglobin type the same for men and women? 

b) Is the population in Hardy—Weinberg equilibrium? 

Exercise 8.4. Colour blindness is more common among males than females. This 
could be due to colour being inherited via a gene on the X -chromosome, where males 


have only one, whereas females have two. Waaler (1927) collected the following data 
on colour blindness of 18121 school children in Oslo: 


Boys Girls Total 
Colour blind 725 40 765 
Normal 8324 9032 17356 
Total 9049 9072 18121 
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If this were the case, and yz denotes the probability that a boy is colour blind, one 
would expect that the probability that a girl is colour blind should be equal to pi? 
corresponding to having two copies of the colour blindness gene. Investigate whether 
the data support such a hypothesis. 


Exercise 8.5. Calculate a 95% confidence interval for the relative risk of getting 
myocardial infarction when using aspirin as compared to placebo in Example 8.6. 


Exercise 8.6. A teaching assistant at the University of Copenhagen consumed 384 
hard-boiled eggs in the academic year 1968/69 and noted every day how many of the 
eggs broke during cooking so that the egg-white flowed into the boiling water, and 
how many cracked without the egg-white leaving the shell. For 130 of these eggs, 
he used a so-called egg-piercer that pierced a small hole at the bottom of the egg, to 
prevent the egg from breaking. The results of his experiment are summarized below: 


Broken Cracked Whole Total 


Pierced 3 16 109 130 
Unpierced 19 36 199 254 
Total 24 52 308 384 


Is the egg-piercer effective? 


Exercise 8.7. The data below are concerned with 1545 fraternal twins of opposite sex 
and their criminal behaviour. Each twin pair is classified after their criminal status, 
resulting in the following table. 


Female criminal Female not criminal Total 


Male criminal 16 286 302 
Male not criminal 24 1219 1243 
Total 40 1505 1545 


a) Are the criminal status of the male and female twin independent? 


b) Is there a difference between the sexes with respect to being criminal? 


Exercise 8.8. Consider the shifted multiplicative Poisson model in the case where 
r = s = 2. Show that the likelihood equations are equivalent to the following equa- 
tion in p: 

(Yoot p)\(Yut+e)  BooBu 

(Yio — p)(Yo. — p) Bio Bor 


with 
Aoo = Yoo + Pp; 01 = Yor — P; Aio = Yio — p, Ara = Yu + p, 


expressing that the estimated cross-product ratio is equal to the cross-product ratio 
of the background risk B. Use this to give an explicit formula for the maximum 
likelihood estimator 2 of in this special case. 
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Exercise 8.9. In 1974 there was a suspicion that the number of lung cancer cases 
was extraordinarily high in the city of Fredericia (Clemmensen et al., 1974). To in- 
vestigate whether this suspicion were justified, cancer incidence in Fredericia was 
compared to cancer incidence in Vejle in different age groups. The tables below dis- 
play the number of cancer cases in two age groups in the two cities, and the sizes of 
the population in the same age groups. 


Cancer cases Population 

55-64 70+ Total 55-59 60-64 
Vejle 12 15 27 3398 1158 
Fredericia 22 18 40 3859 1114 
Total 34 33 67 7257 2272 


Now use the results from the previous exercise to answer the following questions: 
a) Is a shifted multiplicative Poisson model adequate for these data? 


b) Does the suspicion concerning a high incidence of lung cancer in Fredericia seem 
justified? 


Exercise 8.10. Compare the p-value for the exact conditional test in the screen glare 
example in Section 8.3.4.3 with asymptotic p-values using the G? and X? statistics. 


Appendix A 


Auxiliary Results 


A.1 Euclidean vector spaces 
Definition A.1. A Euclidean vector space (V,(-,)) is a finite-dimensional vector 
space over R with an inner product (-, -), satisfying 
i) Forall u,v € V: (u,v) = (v, u); 
ii) For all u,v, w € V and all a,b € R: (u, av + bw) = a(u,v) + bv, w); 
iii) For all v € V \ {0}: (u,v) > 0. 
Note that i) and ii) combined imply that (-,-) is bilinear, i.e. linear in both of its 


arguments. From the inner product, we may now define the Euclidean norm ||v|| = 
/ (u,v) and Euclidean distance d(u,v) = ||u — v||. We note the following: 


Proposition A.2. Let V be a vector space over R. Then any symmetric bilinear form 
(-,-) on V is determined by its values (v,v),v € V on the diagonal. 


Proof. This follows from the relation 


(u,v) = 7 (a+ 0,40) ~ (u—v,u—2)) 


which is easily established using i) and ii) above. 


Observe that R@ is an example of such a space, with the standard inner product 
given by (u,v) = uv and any Euclidean space is isomorphic to R¢; an isomorph- 
ism is constructed by choosing an orthonormal basis, i.e. a system (€1,...,ea) of 
elements of V satisfying (e;,e;) = Oifi ¥ 7 and |le,|| = 1. Then any element u € V 
has a unique representation as 


u= (u,e1)e1 +--+ (u, ea)ea = Q1e1 +++: + Qaea 


where a = a(u) = (a1,.-.,Qq) | is the vector of coordinates of u with respect to 
the chosen basis. The correspondence u ++ a is an isomorphism so that, in particular, 
if v has coordinates { with respect to the chosen basis, we have 


2 


(u,v) =a! B and hence ||ul|? = a? + --- + a2. 


The Borel o-algebra B(V) on a Euclidean space V is the o-algebra generated by open 
sets of V and any isomorphism as above is also bimeasurable so we may identify the 
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Borel sets of V with those of IR“, the particular identification depending on the choice 
of basis. Similarly, the standard Lebesgue measure on V is obtained by identifying 
any v € V with its coordinates in an orthonormal basis; it can be assured that this 
specification does not depend on the particular orthonormal basis chosen; we refrain 
from giving the details. 

If W is another vector space we let L(V, W) denote all linear maps from V to 
W and recall that all linear forms f € £(V,IR) may be uniquely be represented via 
the inner product as 


f(v) = (uz, &)- 


The adjoint A* of alinear map A € L(V, W) is the unique linear map A* € L(W,V) 
satisfying for all v,w € V x W that 


(vu, A*w)y = (Av, w)w 


where (-,-)y and (-,-)yw are the inner products on V and W. Clearly, we have 
(A*)* = A, 

A map A € L(V,V) is self-adjoint if A* = A. Thus for V = R@ and W = R@ 
with standard inner product, the transpose A' € R¢*4 is the matrix for the adjoint 
of the map with matrix A € R?*¢, Then a map is self-adjoint if and only if its matrix 
is symmetric. 

Let now L C V be a linear subspace of V. The orthogonal projection II, onto L 
of u € V is determined as the unique point Il,u € LF satisfying for all w € L 


(u — pu, w) = 0 (A.1) 


or, in other words, the unique point Izu € L satisfying that u — IIzu is orthogonal 
to L with respect to the inner product. Here and elsewhere we shall omit the qualifier 
‘orthogonal’ as all projections we consider are orthogonal. The projection II;u is 
also the point in L that is closest to u in Euclidean distance: 


Tpu =argmin,,||u— ||”. 


If v1,...,Um 18 a Set of mutually orthogonal vectors that span L, the projection 
may be expressed in terms of these as 


fees v1) i peas a bis) Un (A.2) 
Ileal ||Um| 
which in particular simplifies when v1,..., 0m is an orthonormal basis as then 


\|v;|/? = 1. The relation (A.2) follows since then for any w = 010, +++: QmUm € L 
we have 


since (u;,0;) =O0ifi Fj. 
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Theorem A.3. A projection is linear, i.e. Tl € L(V, V). A linear map P is a projec- 
tion if and only if P is idempotent and self-adjoint. Then P is the projection onto its 
range L = rg(P). 


Proof. Linearity of II follows directly from (A.1) and linearity of the inner product 
since if y = au + Bu we have 


(au + Bv — (allu + Gv), w) = a(u — Tu, w) + B(v — Tv, w) = 0. 
Assume that P is self-adjoint and idempotent. Then for all u,v € V we have 
(u — Pu, Pv) = (u, Pv) — (Pu, Pv) = (u, Pv) — (u, P?v) =0 


and hence Pu is the projection onto rg(P). 
Conversely, if P = ITu is the projection of u onto rg(P), we must have P?u = 
Puas Puis the closest point to Pu in rg(P). Also, from (A.1) 


(u, Pv) = (Pu, Pv) = (Pu, v) 


showing that P is self-adjoint. 


If L C V is a subspace of V, its orthogonal complement L+ is the subspace of 
vectors orthogonal to L: 


Lt ={weV| (u,v) = 0 forall v € L}. 


If II, is the projection onto L, J — Iz is the projecton onto L+ as I — I, is idem- 
potent and self-adjoint if and only if II; is. Then any vector v € V has a unique 
decomposition as v = u + w where u € L and w € L+ 


v=Upv+ WU -Typ)v =ut+w 


so V = L@ L+. We note the following relation between the image and null-space 
of a linear map: 


Proposition A.4. Let A € L(V,W) be a linear map between Euclidean spaces 
(V, (-,-)1) and (W, (-,-)2) and A* € L(W,V) its adjoint. Then 


rg(A)+ = ker(A*). 


Hence, if A is surjective, A* is injective and vice versa. Also, we have orthogonal 
decompositions 


V =rg(A) © ker(A*) = ker(A) @ rg(A*). 
Proof. Assume w € rg(A), ie. w = Av and z € ker(A*). Then 
(w, z)2 = (Av, z)o = (v, A*z)1 = (v, 0), = 0 


and hence w L ker(A*). Conversely, if z € ker(.A*), reading this relation from right 
to left yields that z | rg(A). if A is surjective we thus have 


{0} = rg(A)~ = ker(A*) 


whence A* is injective. The vice versa follows since A = (A*)*. 
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The case where A = A’ is self-adjoint yields the following corollary: 
Corollary A.5. Let (V, (-,-)) be a Euclidean space and % € L(V,V) self-adjoint. 
Then V = rg(X) @ ker(X). 

Further we have the following useful lemma: 

Lemma A.6. If % = AA* we have rg(A) = rg(). 


Proof. The inclusion rg(©) € rg(A) follows since 
u= Xv = AA*v = A(A*v). 
For the reverse inclusion, assume wu € rg(A) and thus u = Av for some v € V. 
By Proposition A.4 we may write v = v, + v2 where v; = A*w © rg(A*) and 
v2 € ker(A) implying 
u= Av = AA*w+ Avg = AA*w = Lw 


and hence u € rg(X), as required. 


A.2 Convergence of random variables 


This section collects some standard results from probability theory; consult, for 
example Jacod and Protter (2004) for details. 


A.2.1 Convergence in probability 


Recall the following from Jacod and Protter (2004, p. 143 ff.): 


Definition A.7. [Convergence in probability] A sequence X,...,Xn,... of ran- 
dom variables with values in R* is said to converge in probability to a random vari- 
able X if for all 6 > 0 


lim P{||Xn — X|| > 6} =0 


where ||| is any of the equivalent norms on R*. Symbolically we write X,, Bey 
or plim,, _,., Xn = X 

We note that 
Proposition A.8. /f f : R* — R™ is continuous and Xp, Zs X, then it holds that 
f (Xn) = f(X). 

The (weak) law of large numbers says that 
Theorem A.9 (Law of Large Numbers). /f X1,...,Xn,... are independent and 
identically distributed with values in R® and E(\|X,||) < cv, then the average con- 
verges in probability to the mean € = E(X;). 


- D Ce a nes. On 
Xn = naneeoeey ES 3 for n — oo. 
n 


See Remark 20.1 in Jacod and Protter (2004) for this. There is a similar strong 
law, saying that under the same conditions, the set of w € { where the average 
X (a) does not converge, is a null set. However, it is the weak version above that 
plays a role in Statistics. 
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A.2.2_ Convergence in distribution 


Let X be arandom variable with values in R* and F its distribution function, i.e. 
F(x) = F(a,...,a~) = P{X, <11,...X~ < xz}. 


We then define 


Definition A.10. [Convergence in distribution] A sequence X1,...,Xn,... of ran- 
dom variables with values in R* and distribution functions F;, is said to converge in 
distribution to X if it holds for all continuity points x of F’ that 


lim F(a) = F(a). 


n—->co 


Symbolically we write X, = X or dlimyn_... Xn = X, or even Xp, Eat F, and 
we also say that (X,,) nen converges in distribution to F’. Convergence in distribution 
is preserved for continuous transformations, a result that is known as the continuous 
mapping theorem: 

Theorem A.11. /f f : R* — R"” is continuous and Xp, 2 X, then it holds that 
f(Xn) 3 F(X). 

Note that if X = € is a constant random variable, convergence in distribution and 

in probability coincide: 

Se Gee ay ae 
We shall often use the following result, which is an easy consequence of the con- 
tinuous mapping theorem above and the fact that if X,, 3 X and Yn Ex 7 then 
(Xn, Yn) = (X, c); see for example van der Vaart (2012, p. 11) for details. 
Corollary A.12. Assume (X1,Y1),...,;(Xn,Yn),--. is a sequence of random vari- 
ables with values in R* x R™ and f : R* x R™ > R' is continuous. If Xn 2X 
and Y,, n then f(Xn; Yn) 5 f(X,n). 


Definition A.13. Consider two sequences of random variables X1,...,Xn,... and 
Y1,..., Yn,..-. We say that the sequences are asymptotically equivalent and write 
AG vi 


if Yn — Xn 0 forn > 0. 

Corollary A.12 is also known as Slutsky’s lemma. It follows that asymptotically 
equivalent sequences have the same limiting distribution: 
Proposition A.14. If X,, = Y, then Xn, > X <> Y, 3X: iff: R® 3 R™ is 
continuous and Xy, 2 X then X,=Y, => f(Xn) = f(Yn)- 


Proof: Let Z, = Yn — Xn. Then if X, 2 X, Corollary A.12 (Slutsky’s lemma) 
yields 


a coe Ae 
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and the converse follows by symmetry. Further, if f is continuous, we have 
F(¥n) — f(Xn) = f(Xn + Zn) — f(Xn) 
so if Xp, aX , Slutsky’s lemma yields that 


F(R) Hf OG) > fee 0) FO 0 


S 


and hence f(X,,) = f(Yn). 


An important instance of convergence in distribution is associated with the nor- 
mal distribution. More precisely we define: 


Definition A.15. [Asymptotic normality] A sequence Xj,..., Xp of random vari- 
ables with values in R* is said to be asymptotically normally distributed with asymp- 
totic mean € and asymptotic variance %i/n if 


where Z ~ Nj;,(0,). We then write X,, ~ Ny (€, E/n). 
Note that, formally, this means that for all x € R* 


lim P{Vn(Xn —§) < 2} = Ss(2), 


where the inequality sign is interpreted coordinatewise. Here ®»y is the distribution 
function of the multivariate normal distribution 


vy LE Soy Ea ty/2 
®x(x) = | | en. iy oem dg 
BS Os Jommaas 
We now recall the following fundamental result: 


Theorem A.16 (Central Limit Theorem). If X1,...,X,,... are independent and 
identically distributed with values in R® and E||X,||? < 00, then their average is 
asymptotically normally distributed as 


, XxX +X) as 1 
pe a ANG («.22) 


n 


where € = E(X;) and & = V(X;). 


In other words, the displayed expression in the theorem can be written 
Vn(Xn-€) 3 Z, Z~Ny(0,2). 


We note in passing that a sequence of asymptotically normal random variables always 
converges in probability to its asymptotic mean: 


Corollary A.17. Let X1,...,Xn,...be a sequence of random variables with values 
in R* such that X, ~ Ni (€,%/n). Then plim,, _,,, Xn = € 
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Proof. We have 


k 
Jim P{||Xn — |] > 9} Slim, DPR — &| > dn}. 
But for any fixed K > 0 we have that 


k k 
dim, LE PE vnlXns G1 > Sv} Shim.) PL Vans — Gl > KY) 


k 


S°2(1 — ®(K/oia)). 


i=l 


Letting now kK — oo makes the last expression tend to zero and the desired conclu- 
sion follows. 


We shall also be interested in convergence in distribution and asymptotic equi- 
valence of quadratic forms. So let Z,,,n € N be a sequence of random variables with 
values in R@ and and K,,,n € N a sequence of random d x d matrices. 

Lemma A.18. Assume that the sequence Z,,,n € N converges in distribution to Z 
and the sequence K,,,n € N converges in probability to K. Then it holds that 


GERD ED RD DIEZ. 


Proof. Write 
ZR SD KZ =F Re 


Since kK, — K pa 0, Slutsky’s lemma yields that ZO) Ko Dg 2 ZK; The con- 
vergence in distribution follows from the continuous mapping theorem applied to the 
continuous map z+> 2' Kz. 


A.2.3 The delta method 


From the definition of asymptotic normality it follows that if A is an m x & matrix 
representing a linear map from R* to R™ and X,, ~ Ni, (€, &/n), then 


Y, = AX, © N(AE, ADA /n). 
This follows from Theorem A.11 since x > Az is continuous and 
Vi(¥n — AE) = AVn(Xn — 6) AZ 
and if Z ~ N;,(0, ©) then AZ ~ N;,(0, ADA"). 


The following important result says essentially that differentiable functions be- 
have like linear functions in this respect. 
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Theorem A.19 (The delta method). Let X1,...,Xn,... be a sequence of random 
variables with values in R® such that X, ~ Nj,(€,%/n). Let U be an open and 
convex neighbourhood of € and assume that f : U — R"™ is differentiable. It then 
holds that 


Va f(Xn) — {() = vn DfO(Xn- 4) B DIOZ 
where Z ~ N;,(0, X) and Df (€) is the Jacobian matrix of partial derivatives 


We thus have 


Yn = f(Xn) ~ Nin(F(E), DFE) = DF(E)" /n) 
Proof. Differentiability of f implies 


f(Xn) — FE) = DFE) (Xn — §) + (Xn — §)||Xn — Ell 
where e(x) — 0 for x + 0. Now multiply both sides with ,/n to get 


Vn(f(Xn) — f(E)) = Vn DF (E)\(Xn — €) + €(Xn — €)||lV (Xn — §)| 
But X,, — € by Corollary A.17 so e(X, — £) 4 Oand thus 
Vn(f(Xn) — £(€)) = Vn Df(€)(Xn - &). 
Since /n(X;, — £) 3 N;,(0,5), Corollary A.12 implies 


Vi f(Xn) — f(6)) 3 Nin(0, DF(OEDF(E)") 


as also required. 


Note that we could without loss of generality assume that U = R* as f always 
can be extended to R* in a measurable way. Note also, that if f(x) = Az, then 
Df (€) = Aso this indeed extends the linear case. 


A.3_ Results from real analysis 
A.3.1 Inverse and implicit functions 


We shall need the following fundamental theorems about inverse functions and func- 
tions which are implicitly defined through equations, see, for example (Rudin, 1976, 
p. 221-227) or many other books on real analysis. We recall that a function f is 
smooth if f is infinitely often differentiable. 


Theorem A.20 (Inverse function theorem). Let f : U — R* be a smooth function 
where U C R* is open. Suppose that f is injective and det(Df(x)) 4 0 for all 
x € U. Then f(U) is open and f is a diffeomorphism of U onto f(U). Further, if 
h = f~' then h is smooth and 


Dh(f(x)) = (Df(x))~* (A.3) 


or, in words, the derivative of the inverse function is the inverse of the derivative. 
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The formula in (A.3) is obtained by composite differentiation (Rudin, 1976, The- 
orem 9.25) in the equation h(f(«)) = x, which yields 


Dh Df = Ik 


and thus implies (A.3). 


Theorem A.21 (Implicit function theorem). Let f : Q — R”™ be a smooth function 
where Q C R**"™ is open and let z € R™ be fixed. Let further 


N° = {(z,y) EQ: f(x,y) = 2} 


and let w° = (#°,y°) € 2°. Assume that the determinant det A(w®) of the last m 
columns of the Jacobian matrix 


ites canes 


is non-zero. Then there exist open intervals I © R* and J C R™ with 
wWEeW=IxJCQ 
and a smooth map g : I — J such that 
QAW = {(2,9(x)) : a € Th. 
Moreover, the partial derivatives of g are determined as 


Pate) =-S oA Na, G2) iy EH) (A.4) 


To obtain the formula for the partial derivatives of g we have used a calculation 
known as implicit differentiation. If we have shown that in some neighbourhood, y 
is a smooth function y = g(a) of x we simply use composite differentiation in the 
equation f(a, g(a)) = z. Differentiating on both sides of the equation with respect 
to x, we get 


Di f(x, g(@) In + Dof (x, g(a))Dg(a) = 0, 


where D, f is the matrix of partial derivatives of f(a, y) with respect to x and D2 f 
with respect to y. Now D2f is what has been termed A in the theorem above, so if 
we solve the equation with respect to Dg(x) we get 


Dg = —A "Dif 


which exactly is equation (A.4), written in compact form. 

We note that these theorems also have versions where f is only assumed to be 
r times continuously differentiable for r > 1. The conclusions then warrants the 
implicit or inverse functions to be continuously differentiable of the same order. 
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A.3.2 Taylor approximation 


Taylor approximation is an important and often used technique in mathematical stat- 
istics. It exists in several versions, with the remainder term expressed by an integral, 
with the remainder term represented by an intermediate value (Lagrange’s version), 
or just with an estimate of the size of the remainder term (Peano’s version). We only 
use Taylor approximation of the second-order version so these are the only versions 
given here. 


Theorem A.22 (Taylor’s formula with remainder term in integral form). Let Q C R* 
be an open and convex set and let f : QQ — R be a real-valued and smooth function. 
Then for any wo, w € Q it holds that 


f(w) = f(wo) + Df(wo)(w — wo) + 


(w — wy)! (/ (1 — t)D? f (wo + t(w — w)) ar) (w — wo).(A.5) 


Here Df is the derivative and D? f the Hessian of f. 


Proof. We have for any real-valued and smooth function g 


For u fixed we integrate by parts to obtain 


gu) = 90) +[(t-u)g'(]o — ee — ug" (t) dt 


= (0) +ug'(0) + fw Hg" eat (A6) 
0 
Now use this for g(u) = f(wo + u(w — wo)). Then we have 


9(0) = f(wo), g'(0) = Df(wo)(w — wo) 


and 
g'(u) = (w — wo)" D? f (wo + u(w — wo)) (w — wo). 


Since g(1) = f(w) we get (A.5) when letting wu = 1 in (A.6). 


For completeness we also state the more standard version with Lagrange’s re- 
mainder term, again only in the second-order version. 


Theorem A.23 (Taylor’s formula with Lagrange’s remainder term). Let Q C R* be 
open and convex set and f : Q — Ra real-valued and smooth function. Then for any 
wo, w € Q, there is an w* and w** on the line segment {wo + t(w — wo) : t € [0,1] } 
with 


fw) = feo) + Df(w")(w — wo), 
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and 


f(w) = f(wo) + Df(wo)(w — wo) + 5 — wo)! D? f(w**)(w — wo). 


Here Df is the derivative and D? f the Hessian of f. 


Proof. We define the function g(t) = f(wo+t(w—wo)) and use the one-dimensional 
result (Rudin, 1976, Theorem 5.15) for this function, yielding that 


f(w) = g(1) = g(0) + g(t") = g(0) + 9'(0) + 9" (#")/2 (A.7) 
for some t*, t** € (0,1). By composite differentiation, we get 
g(t) = Df (wo + tw — wo))(w — wo) 


and 
g(t) = (w — wo)" D? f(wo + tw — wo))(w — wo). 


Letting w* = wo + t*(w — wo) and w** = wo + t**(w — wo) yields the result. 


Finally we wish to use a form that works even for functions with values in R™ 
rather than just real-valued functions. 
Theorem A.24 (Taylor’s formula with Peano’s remainder term). Let Q C R* be 
an open and convex set and let f : Q — R™ be a smooth function. Then for any 
wo,w € 2 it holds that 


fw) = flo) + Df(wo)(w— wo) + $w — 0) D? F(uo)(w — wo) 
+e(w — wo) ||w — wol|? 
where 
lim e(z) = 0. 


Proof. See, for example Exercise 9.30 in Rudin (1976) and use the Lagrange version, 
exploiting that the difference D? f(w**) — D? f (wo) — 0 when w — wo. This argu- 
ment provides such an epsilon function e; for each of the functions f;,i =1,...,m. 
Letting e(x) = (€1(x),..-,€m(x))' yields the result. 


A.4 The information inequality 
We shall establish a useful inequality. First recall that for x > 0 
logz<a-1 (A.8) 


and (A.8) is strict unless x = 1. This follows from Taylor’s formula with the Lag- 
range remainder term, expanding around x = 1: 

—1)2 

logx =ax—-1—- a 


for some y between | and x. We then have 
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Lemma A.25 (Information inequality). Let a,b € Re be vectors of positive numbers 
with )>, a; = >, bi. Then 


S/o log.a; < S~ dj log bj (A.9) 


and the inequality is strict unless a; = b; for all 1. 


Proof. From (A.8) we get 
bi log ai — 7b; log bi - Li bilos 5; 
a (F-1) =S°a,-5°b;=0 


which gives the result since the inequality is strict unless a; = 6; for all 7. 


IA 


A.5 Trace of a matrix 

The trace tr(A) of a square matrix A = {a;,;} is the sum of its diagonal elements 
tr(A) = )¢, ai and has, for example the following properties 

1. tr(yA + wB) = ytr(A) + wtr(B) for y, uw € R; 

2.te(A) = ttt A"): 

3. tr(AB) = tr(BA) for any r x s-matrix A and an s x r-matrix B. 

4 


. E(tr(X)) = tr(E(X)) for any random square matrix X with entries having finite 
expectation. 


Here expectations are taken coordinatewise in the last expression. These facts follow 
by direct calculation from the definitions of the trace. For example 


tr(AB) =} (AB)ii = =e ye aijb4i = =e » bjiaiy = | (BA)j5 = tr(BA). 


Appendix B 


Technical Proofs 


In this chapter, we collect some proofs that may appear more technical than inform- 
ative. 


B.1_ Analytic properties of exponential families 


We need a couple of important technical lemmas associated with the normalizing 
constant in regular exponential families. 


B.1.1_Integrability of derivatives 


Lemma B.1. Consider a regular exponential family with canonical parameter space 
©, canonical statistic t, and representation space X. Let further and 0, h satisfy that 
9+ h € ©. It then holds for any n € N that 


Eo (|h't(X)|") < 00 (B.1) 
Proof. Consider the function 
1 A. nO! « 
g(w,m) = (nT ta) "eP"®. 


We first show that this function is integrable with respect to 4 © m where m is 
counting measure on N. Indeed since e!“| < e“ + e~“ for all u € R we have 


/ > |o(n, 2)| waa) iP ye oH) nT H(a) |" ude) 
n=0 
= fy Mol as 


poo Gia ee u(dz) 
= c(O+h)+c(A—h) <a 


IA 


since we have assumed that 0 + h € O. Fubini’s theorem now yields that g(n, -) is 
integrable for all n which yields (B.1). 
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In addition we have the following: 

Lemma B.2. Consider a regular exponential family with canonical parameter space 
©, canonical statistic t, representation space X, and base measure 1. Then for every 
0 € Oand every m1,...,Mxz € No there is a neighbourhood Ug and a p1-integrable 
function hg such that for 7 € Us it holds that 


k 


= [Taner 


i=l 


Omit +m 


ae Tt(x) 
any? Ong 


Proof. Let a;,j = 1,... 2* denote the 2" vectors with coordinates aj, = 1. Let 
further C, denote the cube with ea;,j7 = 1,..., 2* as corners and let 


={n|n-0€ Ce}. 


We have for7 = 6+ h € Ug 


k 
II ti(x)™e7 *() ef | t(x) Ti \ts(a |e h' t(a) 
t=1 
gk 
es eft t(x) Ti Iti(a yy Sve ea, t(x) 
j=l 


If € is chosen sufficiently small, it holds that 0+ a; € Oforj =1..., 2* and hence 
we may let 


=F TL late )|™e (O+ea;) ' t(x) 


g=1li=l 


as each term in the sum is integrable by Theorem 3.8. 


B.1.2.| Quadratic regularity of exponential families 


Proof of Proposition 4.5. Curved exponential families are smooth and locally 
stable, so we only have to establish the additional regularity condition (4.3). We have 


fala) = 088) €)-¥O®)), 


Taylor’s formula with Lagrange’s remainder term (Theorem A.23) yields 


5B |p. 1A = Sr ICY" (ela) — HOD 8) 


where y* is between y and (. Thus if we let 6* = ¢(y*), 6 = $(), and 6 = 6* — 0 
we have 


|fy(@) — fa(2)| < fala)ly - BF)" H@) — rey )leF OKO), 
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Using the same argument as in the proof of Lemma B.2, we have for | — (| suffi- 
ciently small so 6 € C; that 


gk 
FCP) (ea) = rE leP OHO IHHP) < Keallt(x)|| D> (ee *O) = Ho(X) 


i=1 


for some constant Kg and if € is chosen sufficiently small, we have 6 + 2ea; € O 
and therefore 
Eo {Ho(X)?} < 0, 


as desired. 


B.2. Asymptotic existence of the MLE 


Here we shall provide a proof of Lemma 5.25, stating that the MLE in a curved 
exponential family is asymptotically well-defined and given as a smooth function of 
the average canonical statistic. 


Proof of Lemma 5.25. We fix an element 9 € B, let 09 = (Go), and 40 = 
T(b(Bo)) € T(¢(B)). We also introduce the function 


A(n, 8) = 8'n — 4(8) 


and note that then \(7, 8) = A(n, ¢(8)). 

For 7 = 7 we know from the regular case that dno, 0) has a unique maximum 
over 0 € © for 6 = 7~'(no) = (Go) implying that also \(7, 3) has a unique 
maximum in {3 for 8 = {o; thus we have g(No) = g(T((Go))) which is (5.14). 

Next choose an «€ so that C. = {| || — Bol| < €} C B; this is possible because 
B is an open set. Because ¢ is a homeomorphism, ¢~! is continuous and therefore 
there is a 6 such that 


|B — Boll > « => |]6(8) — O(Bo)|| > 6. (B.2) 


Since also © is open, we may assume that {@ | ||@ — A|| < 6} C © by choosing 6 
sufficiently small. 

We note that if 7 — no, then the functions \(n, -) converge uniformly to A(79, -) 
on bounded sets since 


|A(n, 8) — Ano, 8)| = 18" (7 = no)! < [Alli — nol: 


This holds in particular on the sphere S5 = {0 | || — 0o|| = 0} and Ss U {0}. Since 


sup (10,9) < Ano; 80) = A(No, Bo), 
0ESs5 


there is therefore a neighbourhood U,,, of 7 so that for all 7 € U;,, it also holds that 


sup A(7, 9) < A(, 90) = A(n, Bo): 
0ES5 
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As w is strictly convex, Xn, -) is strictly concave for any 7, we may streghten the 
conclusion also to hold for all values outside that sphere: 


sup A(n, ) < A(n, Bo); (B.3) 
{9 | ||6—9o||=5} 


for if there were a 0 outside this sphere with \(7,) > (17, Bo) and 6* € S5 denotes 
the point on the line segment between 6 and 6, we would have 


Nn, 8") < XA(q, 0) and A(7n, 0") < A(, Bo) 


contradicting that \(7, -) is concave. 

The continuous function A(7,-) must attain its maximum over the compact set 
C.. From (B.2) and (B.3) we conclude that A(n, 8) < A(7, Bo) for all 8 ¢ C. and 
thus the maximum over C;, is actually a global maximum. Since the maximizer is 
an interior point in B, it must be a stable point and is thus implicitly defined by the 
equation S(7, 3) = 0 where 


S(n, 8) = sah B) = (n— 7(4(8))" (8). 


We now wish to use the implicit function theorem (Theorem A.21) on this equa- 
tion and get for the partial derivatives with respect to 3 


A(n, Big = an 
k 92 k 
= Fag Om ABD) — Do BE OD aw BS 


or in a more compact form 


A(n, B) = > D?9(B)u(tu — Tu($(8))) — F(B)" «(6(8)) J(B). (B.4) 


For 7 = T(¢(,3)) the first term in (B.4) is zero and thus if 79 = T(#(8o)) we get 


A(no, Bo) = —J(Bo)' &(6(Bo)) J (Bo) = —i( Bo). (B.5) 


Since « is positive definite and J() has full rank we have that if 


0 = —A' A(n0, 80) = —(J(Bo)A) | &(4(Bo)) J (Bo) A 


we must have J(G9)\ = 0 and hence —A(79, Go) is positive definite. Therefore, 
det A(79, 80) 4 0 and Theorem A.21 applies, We conclude that there is a neigh- 
bourhood Ui C U,,, where g is well-defined as a smooth function of 7). Further, we 


have 
08(n,8)" _ 


- 
an FCG). (B.6) 
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Now implicit differentiation combination with (B.5) and (B.6) yields that the function 
g has Jacobian equal to 


Dg(n) = i(g(n))~* F(g(n)) 


We have now established the conclusions of the theorem in a neighbourhood 7) € Use 
of an arbitrary m) € 7(¢(B)). We finally let O = U,,¢7(4(B))U;,, and the proof is 
complete. 


B.3 Iterative proportional scaling 


In this section we establish convergence of the IPS algorithm for two-way classific- 
ations in contingency tables, as described in Section 8.3.3.2. 

Let T= {0,027 — 1.7 = {0,.0.,8— lj, andlt B= 1By,1 E2767} 
be an array of positive numbers B;; > 0. Further, let M = M(B) be the set of 
arrays with positive entries that have the same cross-product ratios as B, i.e. that for 
all 7,2* € Z,9,7* € J satisfy 


il ie Saer (B.7) 

MigeMixg — Bigx Big 
Lemma B.3. A table of means X = {jj} is in the shifted multiplicative Poisson 
family determined by B if and only if X © M(B). 


Proof. If Vij = Lpin; Bij we get 


AjAije _ MPP; Bij Mow Ny Bir jx _ Bij Bers 
AigeAj Meine Bigx Mp Ng Big — Bigx Bir 
and hence A € M(B). 
Conversely, if \ € M(B), we may choose i* = j* = 0 and thus 
AiorA0; Bij;Boo = Aoo AioBoo Ao; Boo 
oo §=BioBo; Boo Avo Bio Avo Boj 


Nj = By = uping Bij 


where 
—_ oo —_ AioBoo —_ Aoi Boo 


= Boo’ aaa Apo Bio’ ee Avo Bo; 


are all positive and p9 = 79 = 1. This completes the proof. 


Thus we may parametrize the shifted multiplicative Poisson model by M(B). 
Consider now the marginal totals from a two-way contingency table 
Yr PF Yat = YH Po + Y4,s-1 = Y44 
and the log-likelihood for m € M(B): 


r—1s-1 


fy(m) = a S- (yij log myy — mi) . (B.8) 


i=0 j=0 
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We note that if ™m satisfies the likelihood equations (8.13), we have m44 = y+ so 
if we define 


M,(B) = {me M(B): m4 = y44}, 
we may as well maximize @, over M,,(B) rather than M(B). If we let 
log Mig = 7 + a; + B; + log Bi; 
form € M,,(B), (B.8) simplifies to 


r—1ls-1 


= Sma Sg S vh a Sis log By; a Ears @ _ 1). 


1=0 j7=0 


We then have an important lemma. 


Lemma B.4. /f all marginal totals of y are positive and m”,n € N is a sequence of 
elements in M,(B) so that mi, + 0 for n — 00, then t,(m”) — —oo for n + ov. 


Proof. Consider the array x given as v;,; = yi+y4;/y4+. Then all entries of x are 
positive and has the same marginal totals as y. 


Li- =Yi tEL, Ley =HYap JET, C44 =Y44- 


Thus the formal Poisson likelihood based on zx is 


r—1ls-1 
£,(m) => Sina YA +3030 25; log Bi; + taisl(y— 1) 
7=0 j=0 
r—-1s-1 
= m) + ~ Vey — yij) log Bij = £y(m) + constant. 
i=0 j=0 


Thus, apart from a constant, we have form € M,(B) 


r—1s-1 r-ls-1 
ly(m”) = S- Sous log mis +c, = Se Scag log mis + C2 
4=0 3=0 i=0 j=0 


where c and cz are constants. But since x;; > 0 for all 77 and entries of m are 
bounded above in M,(B), the last expression clearly tends to —oo for mj, — 0, as 
desired. 


Next, we let m? € M,,(B) be determined as 


o _ Y+t+ 


Mi; = 


ay 
Byy 


and define 


M)(B ={me M(B )3 M4 = Yt and ¢,(m) > £y(m°) ks 
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We may then show 


Corollary B.5. If y has all marginal totals positive, the MLE of m based on y exists. 


Proof. Any maximizer of ¢, over M,,(B) is also a maximizer over M))(B) and vice 
versa. By Lemma B.4, the latter is closed and bounded, hence compact and since @,, 
is continuous in m, it attains its maximum over M))(B), hence over M(B) and 


M(B). 


We now define the proportional scaling operations Tr and Tg on M(B) as 


Tr(m)ij = my, tEeLjeTd; To(m)ig = mij AL, teLTjeT 
MmGi+ M44 
and note the following 


Lemma B.6. The scaling operations are continuous operations on M}j(B) and sat- 
isfy 
Tr(M,(B)) © My(B),  To(My(B)) © M,(B). 


Further, they adjust the marginals to satisfy 
Tr(m)it =Y,t=1ET; Tolm)4j5 = 53,7 =1€ TS. 


In addition, it holds for allm € M))(B) that ¢y(Tr(m)) > €,(m) with equality if 
and only if m+ = yi+,t € LT and similarly €,(To(m)) > €y(m) with equality if 
and only if m4, = Y4+j,J € FT 

Proof. The operations are obviously continuous. To see that T’p preserves cross- 


product ratios, we let 7; = y;4/mi+ and get 


Tr(m)ijTR(M) «5+ _ Mig VM ix je Ver _ Mag Man jm _ By; Bux 5 


Tr(m)ij-TR(m)iey Migr VEMin 5 Vix Mie Mixg — Bijx Bir 


and similarly for T¢. 
Next we have 
s—l Yi Yi 
T w= got = mas = 
RCM) i+ em oe Yi+ 
and similarly for Tc, showing that marginals are adjusted, implying also that 
Tr(m)+4 = To(m)+4 = y4+4 so we have shown that Tp(M,(B)) C M,(B) 


and similarly for Tc. 
Now consider the change in likelihood after updating the rows, say. We get 


r—1s-1 
ty(Tr(m)) = S> Sd - yij(log mi; + log yi4 — log mit) 
i=0 j=0 
r—1 
= f,y(m)+ S- yit (log yi+ — log m+). 
i=0 


236 TECHNICAL PROOFS 
The information inequality in Lemma A.25 therefore implies 
£,(Lp(m)) > 6, (m) (B.9) 


with equality if and only if y;4 = m,+ for all 2 € Z. Similarly we get for all m € 
M(B) with m,4 = n that 


ly(To(m)) > by(m) (B.10) 


and the inequality is strict unless y.; = m, for all 7 € 7. and since the likelihood 
increases whenever we scale by (B.9) and (B.10) it follows from Lemma B.6 that 
we also have Tp(M?(B)) C M?(B) and To(M?(B)) C M{(B) and the proof is 
complete. 


We next define the iteration 
m+) = Ta(To(m”)),n = 0,1,2... 


and may now show: 


Theorem B.7. The sequence m” of arrays obtained by iterative proportional scaling 
converges to the maximum likelihood estimate ™m of the mean in the shifted multiplic- 
ative Poisson model: 


Proof. The set M}(B) is compact so we just need to show that every convergent 
subsequence of the sequence above has 77 as limit. So let m”*,k = 1,2,... denote 
a convergent subsequence and let m* be its limit. Since the log-likelihood increases 
at every scaling, we have that ¢,,(m”) is non-decreasing in n and thus 


by(Tn(To(m"))) = lim by(Te(Lo(m™))) < by(m™*)) = ly(m*). 


But since also 
ly(Tr(To(m*))) 2 ty(To(m*)) = ly(m*) 


we must have equality and thus 
Tr(Tc(m*)) = Te(m*), Toe(m*) =m*, Tr(m*) =m’*, 
implying that m* satisfies the likelihood equations 


Mis = Yi4,4 € T, M5 = Yt JET 


and hence is the unique MLE. 
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